Error during large dataset indexing

Hello. I try to index a 900MB gzipped csv file. At the end the job fails with the error below:

2016-06-20T11:29:02,245 INFO [pool-19-thread-1] io.druid.segment.IndexMerger - Completed dimension[transactionEnd] in 425,079 millis.
2016-06-20T11:29:02,245 INFO [pool-19-thread-1] io.druid.segment.IndexMerger - outDir[var/tmp/base8250600194428777487flush/merged/v8-tmp] completed inverted.drd in 1,521,706 millis.
2016-06-20T11:29:02,249 INFO [pool-19-thread-1] io.druid.segment.IndexMerger - wrote metadata.drd in outDir[var/tmp/base8250600194428777487flush/merged/v8-tmp].
2016-06-20T11:29:02,260 INFO [Thread-21] org.apache.hadoop.mapred.LocalJobRunner - reduce task executor complete.
2016-06-20T11:29:02,411 WARN [Thread-21] org.apache.hadoop.mapred.LocalJobRunner - job_local1582446619_0001
java.lang.Exception: java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
	at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462) ~[hadoop-mapreduce-client-common-2.3.0.jar:?]
	at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529) [hadoop-mapreduce-client-common-2.3.0.jar:?]
Caused by: java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
	at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:860) ~[?:1.7.0_101]
	at com.google.common.io.Files.map(Files.java:864) ~[guava-16.0.1.jar:?]
	at com.google.common.io.Files.map(Files.java:851) ~[guava-16.0.1.jar:?]
	at com.google.common.io.Files.map(Files.java:818) ~[guava-16.0.1.jar:?]
	at com.google.common.io.Files.map(Files.java:790) ~[guava-16.0.1.jar:?]
	at com.metamx.common.io.smoosh.FileSmoosher.add(FileSmoosher.java:114) ~[java-util-0.27.7.jar:?]
	at com.metamx.common.io.smoosh.Smoosh.smoosh(Smoosh.java:62) ~[java-util-0.27.7.jar:?]
	at io.druid.segment.IndexMerger.makeIndexFiles(IndexMerger.java:987) ~[druid-processing-0.9.0.jar:0.9.0]
	at io.druid.segment.IndexMerger.merge(IndexMerger.java:421) ~[druid-processing-0.9.0.jar:0.9.0]
	at io.druid.segment.IndexMerger.mergeQueryableIndex(IndexMerger.java:242) ~[druid-processing-0.9.0.jar:0.9.0]
	at io.druid.indexer.IndexGeneratorJob$IndexGeneratorReducer.mergeQueryableIndex(IndexGeneratorJob.java:519) ~[druid-indexing-hadoop-0.9.0.jar:0.9.0]
	at io.druid.indexer.IndexGeneratorJob$IndexGeneratorReducer.reduce(IndexGeneratorJob.java:686) ~[druid-indexing-hadoop-0.9.0.jar:0.9.0]
	at io.druid.indexer.IndexGeneratorJob$IndexGeneratorReducer.reduce(IndexGeneratorJob.java:469) ~[druid-indexing-hadoop-0.9.0.jar:0.9.0]
	at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171) ~[hadoop-mapreduce-client-core-2.3.0.jar:?]
	at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627) ~[hadoop-mapreduce-client-core-2.3.0.jar:?]
	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389) ~[hadoop-mapreduce-client-core-2.3.0.jar:?]
	at org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319) ~[hadoop-mapreduce-client-common-2.3.0.jar:?]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) ~[?:1.7.0_101]
	at java.util.concurrent.FutureTask.run(FutureTask.java:262) ~[?:1.7.0_101]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) ~[?:1.7.0_101]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) ~[?:1.7.0_101]
	at java.lang.Thread.run(Thread.java:745) ~[?:1.7.0_101]
2016-06-20T11:29:02,719 INFO [task-runner-0-priority-0] org.apache.hadoop.mapreduce.Job - Job job_local1582446619_0001 failed with state FAILED due to: NA


If I look into tmp folder (var/tmp/base8250600194428777487flush/merged/v8-tmp) I see some large files, the largest being inverted.drd of size 2.4G. I believe you really cannot map such large file. Is there anything I can do to work around this issue?

Thank you in advance...
Nikita

Hi Nikita,
There is a 2G maximum byte size limit for column in druid.

Try setting targetPartitionSize to create smaller partitions.

Thank you, will try this :slight_smile: