Batch Ingestion Failing - DImension Based Paritoning - Druid 0.13 - URGENT

Hi All,

We upgraded our production druid cluster from Druid 0.9.1 to Druid 0.13 because with Druid 0.9.1 when we changed from Hashed Based Partitioning to Dimension Based Partitioning , our hadoop batch ingestion started failing with following exception :

2018-06-27 14:35:05,763 INFO [IPC Server handler 15 on 34918]org.apache.hadoop.mapred.TaskAttemptListenerImpl: Diagnostics report from attempt_1515062285112_1094_r_000013_3: Error: java.lang.IllegalStateException: Wrote[2208368448] bytes, which is too many.

at com.google.common.base.Preconditions.checkState(Preconditions.java:200)

at io.druid.segment.data.GenericIndexedWriter.close(GenericIndexedWriter.java:109)

at io.druid.segment.serde.ComplexMetricColumnSerializer.close(ComplexMetricColumnSerializer.java:76)

at io.druid.segment.IndexMerger.makeIndexFiles(IndexMerger.java:877)

at io.druid.segment.IndexMerger.merge(IndexMerger.java:423)

at io.druid.segment.IndexMerger.mergeQueryableIndex(IndexMerger.java:244)

at io.druid.indexer.IndexGeneratorJob$IndexGeneratorReducer.mergeQueryableIndex(IndexGeneratorJob.java:519)

at io.druid.indexer.IndexGeneratorJob$IndexGeneratorReducer.reduce(IndexGeneratorJob.java:686)

at io.druid.indexer.IndexGeneratorJob$IndexGeneratorReducer.reduce(IndexGeneratorJob.java:469)

That time when we debugged the issue , we found out that this is expected in Druid 0.9.1 and it is being fixed in Druid 0.13 ( Look at this PR : https://github.com/druid-io/druid/pull/3743) .

So we decided to upgrade our cluster to Druid 0.13, But now When we tried running lambda job , it started failing with not the above exception but again becuase of the condition related to INTEGER_MAX somewhere in GenericIndexedWriter.java class.

Could someone please help us out here that what are the limitation in Druid 0.13 version for Single Dimension Based Partitioning in terms of Number of rows in a segment possible, maximum size of the segment , maximum size of the column etc.

This is urgently required to be figured as we can not see any documents online and code does not seems to give us any confidence with our understanding.

Thanks,

Pravesh Gupta (Adobe)

Hey Pravesh,

Before #3743 the limits were that the number of rows and the maximum size in bytes of each column must both fit in ints (<2 billion / <2GB or so). #3743 was meant to raise the limit in byte size for complex columns, but did not do it for all other types of columns, and did not change the row limit. What error & stack trace are you getting in 0.13.0?

Btw, you might be able to work around whatever problem you’re seeing by reducing your targetPartitionSize.

Gian

Thanks for the reply Gian.

How reducing the targetPartitionSize would fix this ? I guess when using Dimension Based Partitioning, All the Rows of having same dimension value, would go into EXACTLY one Segment, No Matter how much rows would come in . Correct ??

As far as error is concerned , we were having issues at the reduce job in EMR , reduce jobs were failing. We tried increasing our EMR cluster Storage, No Luck. There were lot of cascading errors.

One more thing Gian,

Our data is such that , the dimension we want to partition on, for some parituclar value of that dimension we are expected to get 180 millions rows in one day. So Are we going to run into issues with that ? What should be the target Partition Size in that case ?

Any Help on this Please??