Persist index error and ArrayOutOfBoundsException while batch ingestion...

I am using hadoop indexer (java -Xmx12G -Duser.timezone=UTC -Dfile.encoding=UTF-8 -classpath lib/*:conf/druid/_common io.druid.cli.Main index hadoop specfile.json) because the other indexer gives errors that I cannot understand.

Here is the end of my log file:

2016-05-18T15:12:51,908 INFO [communication thread] org.apache.hadoop.mapred.LocalJobRunner - reduce > reduce

2016-05-18T15:12:52,184 INFO [pool-10-thread-1] io.druid.indexer.IndexGeneratorJob - Maximum number of rows [50000] reached

2016-05-18T15:12:52,185 INFO [pool-10-thread-1] io.druid.indexer.IndexGeneratorJob - 50,001 lines to 50,000 rows in 1,402 millis

2016-05-18T15:12:52,185 INFO [index00020-persist] io.druid.segment.IndexMerger - Starting persist for interval[2016-02-21T00:00:00.000Z/2016-02-22T00:00:00.000Z], rows[50,000]

2016-05-18T15:12:52,232 INFO [index00020-persist] io.druid.segment.IndexMerger - outDir[/tmp/base880361292817636286flush/index00020/v8-tmp] completed index.drd in 1 millis.

2016-05-18T15:12:52,258 INFO [index00020-persist] io.druid.segment.IndexMerger - outDir[/tmp/base880361292817636286flush/index00020/v8-tmp] completed dim conversions in 26 millis.

2016-05-18T15:12:52,262 ERROR [index00020-persist] io.druid.indexer.IndexGeneratorJob - persist index error

2016-05-18T15:12:53,635 INFO [pool-10-thread-1] io.druid.indexer.IndexGeneratorJob - Maximum number of rows [50000] reached

2016-05-18T15:12:53,635 INFO [pool-10-thread-1] io.druid.indexer.IndexGeneratorJob - 50,001 lines to 50,000 rows in 1,370 millis

2016-05-18T15:12:53,636 INFO [index00021-persist] io.druid.segment.IndexMerger - Starting persist for interval[2016-02-21T00:00:00.000Z/2016-02-22T00:00:00.000Z], rows[50,000]

2016-05-18T15:12:53,709 INFO [index00021-persist] io.druid.segment.IndexMerger - outDir[/tmp/base880361292817636286flush/index00021/v8-tmp] completed index.drd in 0 millis.

2016-05-18T15:12:53,745 INFO [index00021-persist] io.druid.segment.IndexMerger - outDir[/tmp/base880361292817636286flush/index00021/v8-tmp] completed dim conversions in 36 millis.

2016-05-18T15:12:53,750 ERROR [index00021-persist] io.druid.indexer.IndexGeneratorJob - persist index error

2016-05-18T15:12:54,908 INFO [communication thread] org.apache.hadoop.mapred.LocalJobRunner - reduce > reduce

2016-05-18T15:12:55,019 INFO [pool-10-thread-1] io.druid.indexer.IndexGeneratorJob - Maximum number of rows [50000] reached

2016-05-18T15:12:55,019 INFO [pool-10-thread-1] io.druid.indexer.IndexGeneratorJob - 50,006 lines to 50,000 rows in 1,267 millis

2016-05-18T15:12:55,019 INFO [index00022-persist] io.druid.segment.IndexMerger - Starting persist for interval[2016-02-21T00:00:00.000Z/2016-02-22T00:00:00.000Z], rows[50,000]

2016-05-18T15:12:55,084 INFO [index00022-persist] io.druid.segment.IndexMerger - outDir[/tmp/base880361292817636286flush/index00022/v8-tmp] completed index.drd in 0 millis.

2016-05-18T15:12:55,126 INFO [index00022-persist] io.druid.segment.IndexMerger - outDir[/tmp/base880361292817636286flush/index00022/v8-tmp] completed dim conversions in 42 millis.

2016-05-18T15:12:55,131 ERROR [index00022-persist] io.druid.indexer.IndexGeneratorJob - persist index error

2016-05-18T15:12:56,474 INFO [pool-10-thread-1] io.druid.indexer.IndexGeneratorJob - Maximum number of rows [50000] reached

2016-05-18T15:12:56,474 INFO [pool-10-thread-1] io.druid.indexer.IndexGeneratorJob - 50,000 lines to 50,000 rows in 1,341 millis

2016-05-18T15:12:56,474 INFO [index00023-persist] io.druid.segment.IndexMerger - Starting persist for interval[2016-02-21T00:00:00.000Z/2016-02-22T00:00:00.000Z], rows[50,000]

2016-05-18T15:12:56,536 INFO [index00023-persist] io.druid.segment.IndexMerger - outDir[/tmp/base880361292817636286flush/index00023/v8-tmp] completed index.drd in 0 millis.

2016-05-18T15:12:56,567 INFO [index00023-persist] io.druid.segment.IndexMerger - outDir[/tmp/base880361292817636286flush/index00023/v8-tmp] completed dim conversions in 31 millis.

2016-05-18T15:12:56,572 ERROR [index00023-persist] io.druid.indexer.IndexGeneratorJob - persist index error

2016-05-18T15:12:56,816 INFO [pool-10-thread-1] io.druid.indexer.IndexGeneratorJob - 1,209,683 lines completed.

2016-05-18T15:12:56,816 INFO [pool-10-thread-1] io.druid.segment.IndexMerger - Starting persist for interval[2016-02-21T00:00:00.000Z/2016-02-22T00:00:00.000Z], rows[9,628]

2016-05-18T15:12:56,832 INFO [pool-10-thread-1] io.druid.segment.IndexMerger - outDir[/tmp/base880361292817636286flush/final/v8-tmp] completed index.drd in 0 millis.

2016-05-18T15:12:56,849 INFO [pool-10-thread-1] io.druid.segment.IndexMerger - outDir[/tmp/base880361292817636286flush/final/v8-tmp] completed dim conversions in 17 millis.

2016-05-18T15:12:56,854 INFO [Thread-232] org.apache.hadoop.mapred.LocalJobRunner - reduce task executor complete.

2016-05-18T15:12:57,267 WARN [Thread-232] org.apache.hadoop.mapred.LocalJobRunner - job_local1298338564_0002

java.lang.Exception: java.lang.ArrayIndexOutOfBoundsException: 39

at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462) ~[hadoop-mapreduce-client-common-2.3.0.jar:?]

at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529) [hadoop-mapreduce-client-common-2.3.0.jar:?]

Caused by: java.lang.ArrayIndexOutOfBoundsException: 39

at io.druid.segment.IndexMerger$10.apply(IndexMerger.java:1057) ~[druid-processing-0.9.0.jar:0.9.0]

at io.druid.segment.IndexMerger$10.apply(IndexMerger.java:1038) ~[druid-processing-0.9.0.jar:0.9.0]

at com.google.common.collect.Iterators$8.transform(Iterators.java:794) ~[guava-16.0.1.jar:?]

at com.google.common.collect.TransformedIterator.next(TransformedIterator.java:48) ~[guava-16.0.1.jar:?]

at com.google.common.collect.TransformedIterator.next(TransformedIterator.java:48) ~[guava-16.0.1.jar:?]

at com.google.common.collect.Iterators$PeekingImpl.next(Iterators.java:1145) ~[guava-16.0.1.jar:?]

at com.metamx.common.guava.MergeIterator.next(MergeIterator.java:75) ~[java-util-0.27.7.jar:?]

at com.google.common.collect.Iterators$PeekingImpl.next(Iterators.java:1145) ~[guava-16.0.1.jar:?]

at io.druid.collections.CombiningIterator.next(CombiningIterator.java:75) ~[druid-common-0.9.0.jar:0.9.0]

at io.druid.segment.IndexMerger.makeIndexFiles(IndexMerger.java:791) ~[druid-processing-0.9.0.jar:0.9.0]

at io.druid.segment.IndexMerger.merge(IndexMerger.java:421) ~[druid-processing-0.9.0.jar:0.9.0]

at io.druid.segment.IndexMerger.persist(IndexMerger.java:193) ~[druid-processing-0.9.0.jar:0.9.0]

at io.druid.indexer.IndexGeneratorJob$IndexGeneratorReducer.persist(IndexGeneratorJob.java:501) ~[druid-indexing-hadoop-0.9.0.jar:0.9.0]

at io.druid.indexer.IndexGeneratorJob$IndexGeneratorReducer.reduce(IndexGeneratorJob.java:676) ~[druid-indexing-hadoop-0.9.0.jar:0.9.0]

at io.druid.indexer.IndexGeneratorJob$IndexGeneratorReducer.reduce(IndexGeneratorJob.java:469) ~[druid-indexing-hadoop-0.9.0.jar:0.9.0]

at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171) ~[hadoop-mapreduce-client-core-2.3.0.jar:?]

at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627) ~[hadoop-mapreduce-client-core-2.3.0.jar:?]

at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389) ~[hadoop-mapreduce-client-core-2.3.0.jar:?]

at org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319) ~[hadoop-mapreduce-client-common-2.3.0.jar:?]

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[?:1.8.0_74]

at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_74]

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[?:1.8.0_74]

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[?:1.8.0_74]

at java.lang.Thread.run(Thread.java:745) ~[?:1.8.0_74]

2016-05-18T15:12:57,531 INFO [main] org.apache.hadoop.mapreduce.Job - Job job_local1298338564_0002 failed with state FAILED due to: NA

2016-05-18T15:12:57,558 INFO [main] org.apache.hadoop.mapreduce.Job - Counters: 38

File System Counters

FILE: Number of bytes read=1352583638495

FILE: Number of bytes written=626472806399

FILE: Number of read operations=0

FILE: Number of large read operations=0

FILE: Number of write operations=0

HDFS: Number of bytes read=0

HDFS: Number of bytes written=0

HDFS: Number of read operations=0

HDFS: Number of large read operations=0

HDFS: Number of write operations=0

Map-Reduce Framework

Map input records=8592036

Map output records=8535605

Map output bytes=6084709857

Map output materialized bytes=6118858955

Input split bytes=42771

Combine input records=0

Combine output records=0

Reduce input groups=1

Reduce shuffle bytes=6118858955

Reduce input records=1146208

Reduce output records=0

Spilled Records=8535605

Shuffled Maps =1113

Failed Shuffles=0

Merged Map outputs=1113

GC time elapsed (ms)=4982

CPU time spent (ms)=0

Physical memory (bytes) snapshot=0

Virtual memory (bytes) snapshot=0

Total committed heap usage (bytes)=877354024960

Shuffle Errors

BAD_ID=0

CONNECTION=0

IO_ERROR=0

WRONG_LENGTH=0

WRONG_MAP=0

WRONG_REDUCE=0

File Input Format Counters

Bytes Read=0

File Output Format Counters

Bytes Written=56

2016-05-18T15:12:57,573 INFO [main] io.druid.indexer.JobHelper - Deleting path[file:/tmp/testdata/2016-05-18T150008.063Z/27fd216025c941a1a096775f5ddc5fdd]

2016-05-18T15:12:57,584 ERROR [main] io.druid.cli.CliHadoopIndexer - failure!!!

java.lang.reflect.InvocationTargetException

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_74]

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_74]

at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_74]

at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_74]

at io.druid.cli.CliHadoopIndexer.run(CliHadoopIndexer.java:115) [druid-services-0.9.0.jar:0.9.0]

at io.druid.cli.Main.main(Main.java:105) [druid-services-0.9.0.jar:0.9.0]

Caused by: com.metamx.common.ISE: Job[class io.druid.indexer.IndexGeneratorJob] failed!

at io.druid.indexer.JobHelper.runJobs(JobHelper.java:343) ~[druid-indexing-hadoop-0.9.0.jar:0.9.0]

at io.druid.indexer.HadoopDruidIndexerJob.run(HadoopDruidIndexerJob.java:94) ~[druid-indexing-hadoop-0.9.0.jar:0.9.0]

at io.druid.indexer.JobHelper.runJobs(JobHelper.java:323) ~[druid-indexing-hadoop-0.9.0.jar:0.9.0]

at io.druid.cli.CliInternalHadoopIndexer.run(CliInternalHadoopIndexer.java:130) ~[druid-services-0.9.0.jar:0.9.0]

at io.druid.cli.Main.main(Main.java:105) ~[druid-services-0.9.0.jar:0.9.0]

``

How can I overcome these errors ? (In Druid 0.8.3 I was having Java Heap Space error with both Overlord indexing task and hadoop indexing. I handled that with reducing maxRowsInMemory to 50000, I am using the same thing now)

Thanks!

Addition: Using Indexing Service with 10G peons, I got the exact same error.

Addition again: I got so many metrics(50+), I reduced them to 3 in spec file, now ingestion is complete. Is there an explanation for this ?

Hi prinull,
There is a 2G maximum byte size limit for column in druid. I wonder if some of the metric is reaching that limit causing this error.

Try reducing your targetPartitionSize in order to have smaller segments.

Hi

I have the same issue. Did you find the resolution. ? if i reduce the metrics it works. if i keep the metrics (25), even if task run 1 minutes, upon completion, it fails to write to disk.

Hi Fatih, can you post your task log?

Hi David

I believe its due to null fields in csv format, i move to json it works without problem.
Somehow, it doesn’t like the null and shift the fields, before it handing over the segment to deep storage, i can see that dimension field has shifted from peons.

do parser has enclosure option in parsing, i couldn’t find in documentation.

log.log (187 KB)

The CSV parser can only be configured with a list delimiter for multi-valued dimensions.

Have you moved from CSV to JSON now? Is everything working alright?