Hadoop single dimension partitioning: No suitable partitioning dimension found

I’m working on getting Hadoop Indexing to work using single partitioning to work in Druid .15 and running into a blocker with the below error. We were able to index several days worth of data before this started happening and I can’t seem to figure out what’s significantly different between the days that are failing and the ones that worked.


Error: org.apache.druid.java.util.common.ISE: No suitable partitioning dimension found! at org.apache.druid.indexer.DeterminePartitionsJob$DeterminePartitionsDimSelectionReducer.innerReduce(DeterminePartitionsJob.java:801) at org.apache.druid.indexer.DeterminePartitionsJob$DeterminePartitionsDimSelectionBaseReducer.reduce(DeterminePartitionsJob.java:548) at org.apache.druid.indexer.DeterminePartitionsJob$DeterminePartitionsDimSelectionBaseReducer.reduce(DeterminePartitionsJob.java:524) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171)

From reading this thread I understand this can happen if a single dimension contains more than the maxPartitionSize rows limit:
https://groups.google.com/forum/#!topic/druid-development/rJhJ6hxkysc

So I checked and the largest dimension does contain about 9 million rows so I bumped the limit to 10M but it still failed, tried again with 20M and it failed again. I’ve also tried adjusting the targetPartitionSize up and down to no effects and am running out of ideas. There are 213 distinct values in the partition column we are using in this time period with max row count per value around 9M and min of 1.

Any ideas what I should try next to get this to work? This is a very large dataset that eats up a lot of CPU time on our cluster and our testing has shown this feature’s ability to prune shards on segments scans when filtering on the partition dimension can provide a big performance boost and reduce CPU utilization so we’d really like to figure this out.

Part of the indexing spec:

“tuningConfig”: {
“type”: “hadoop”,
“useCombiner”: “true”,
“combineText”: “true”,
“maxRowsInMemory”: 50000,
“numBackgroundPersistThreads”: “1”,
“partitionsSpec”: {
“type”: “dimension”,
“targetPartitionSize”: 15000000,
“maxPartitionSize”: 25000000,
“partitionDimension”: “mdse_dept_ref_i”
},
“jobProperties”: {
“mapreduce.job.running.reduce.limit”: “250”,
“mapreduce.job.maps”: “20”,
“mapreduce.job.queuename”: “SVGFDRUID01P”,
“mapreduce.job.jvm.numtasks”: “20”,
“mapreduce.reduce.java.opts”: “-Xmx9216m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -Ddruid.indexing.doubleStorage=double”,
“mapreduce.map.output.compress”: “true”,
“mapreduce.map.memory.mb”: “5461”,
“mapreduce.job.running.map.limit”: “250”,
“mapreduce.input.fileinputformat.split.minsize”: “200000000”,
“mapreduce.reduce.memory.mb”: “12288”,
“mapreduce.map.java.opts”: “-Xmx4096m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -Ddruid.indexing.doubleStorage=double”,
“mapreduce.input.fileinputformat.split.maxsize”: “500000000”
},

Anyone using single dimension partitioning run into this error?

For anyone who runs into this in the future, at-least in this case what was causing this was null column values for our partition dimension for a small number of rows in the dataset.