Big segment and merge is too slow.

We have a big data source, so i change segmentGranularity to TEN_MINUTE, but i see just one big segment merge go on a very long time and don’t finished current time. And the segment keep the other segment merge and push wait.

The semgment’s dir is 4.9G.

4.9G ./2015-10-28T00:50:00.000+08:00_2015-10-28T01:00:00.000+08:00

And some log is :

2015-10-28T01:39:48,225 INFO [dumplog_bts-2015-10-28T00:50:00.000+08:00-persist-n-merge] io.druid.segment.realtime.plumber.RealtimePlumber - Adding hydrant[FireHydrant{index=null, queryable=io.druid.segment.ReferenceCountingSegment@13ea9964, count=554}]

2015-10-28T01:39:48,264 INFO [dumplog_bts-2015-10-28T00:50:00.000+08:00-persist-n-merge] io.druid.segment.IndexMerger - outDir[/dump/11/druid/persistent/task/dumplog_bts_realtime4/work/persist/dumplog_bts/2015-10-28T00:50:00.000+08:00_2015-10-28T01:00:00.000+08:00/merged/v8-tmp] completed index.drd in 5 millis.

2015-10-28T04:20:03,268 INFO [dumplog_bts-2015-10-28T00:50:00.000+08:00-persist-n-merge] io.druid.segment.IndexMerger - outDir[/dump/11/druid/persistent/task/dumplog_bts_realtime4/work/persist/dumplog_bts/2015-10-28T00:50:00.000+08:00_2015-10-28T01:00:00.000+08:00/merged/v8-tmp] completed dim conversions in 9,615,004 millis.

2015-10-28T04:20:19,372 INFO [dumplog_bts-2015-10-28T00:50:00.000+08:00-persist-n-merge] io.druid.segment.IndexMerger - outDir[/dump/11/druid/persistent/task/dumplog_bts_realtime4/work/persist/dumplog_bts/2015-10-28T00:50:00.000+08:00_2015-10-28T01:00:00.000+08:00/merged/v8-tmp] walked 500,000/500,000 rows in 16,092 millis.

2015-10-28T04:24:03,226 INFO [dumplog_bts-2015-10-28T00:50:00.000+08:00-persist-n-merge] io.druid.segment.IndexMerger - outDir[/dump/11/druid/persistent/task/dumplog_bts_realtime4/work/persist/dumplog_bts/2015-10-28T00:50:00.000+08:00_2015-10-28T01:00:00.000+08:00/merged/v8-tmp] walked 500,000/48,500,000 rows in 2,158 millis.

2015-10-28T04:24:18,088 INFO [dumplog_bts-2015-10-28T00:50:00.000+08:00-persist-n-merge] io.druid.segment.IndexMerger - outDir[/dump/11/druid/persistent/task/dumplog_bts_realtime4/work/persist/dumplog_bts/2015-10-28T00:50:00.000+08:00_2015-10-28T01:00:00.000+08:00/merged/v8-tmp] completed walk through of 48,638,641 rows in 254,820 millis.

2015-10-28T04:24:18,089 INFO [dumplog_bts-2015-10-28T00:50:00.000+08:00-persist-n-merge] io.druid.segment.IndexMerger - Starting dimension[call_info] with cardinality[22]

2015-10-28T04:24:51,698 INFO [dumplog_bts-2015-10-28T00:50:00.000+08:00-persist-n-merge] io.druid.segment.IndexMerger - Completed dimension[call_info] in 33,610 millis.

2015-10-28T04:24:51,698 INFO [dumplog_bts-2015-10-28T00:50:00.000+08:00-persist-n-merge] io.druid.segment.IndexMerger - Starting dimension[id] with cardinality[48,337,858]

and until now the dimension[id] is not finished.

jstack is :

“dumplog_bts-2015-10-28T00:50:00.000+08:00-persist-n-merge” daemon prio=10 tid=0x00007fd330c79000 nid=0x21b87 runnable [0x000000004ac21000]

java.lang.Thread.State: RUNNABLE

at java.lang.StringCoding$StringDecoder.decode(StringCoding.java:153)

at java.lang.StringCoding.decode(StringCoding.java:193)

at java.lang.String.(String.java:416)

at java.lang.String.(String.java:481)

at com.metamx.common.StringUtils.fromUtf8(StringUtils.java:39)

at com.metamx.common.StringUtils.fromUtf8(StringUtils.java:51)

at io.druid.segment.data.GenericIndexed$2.fromByteBuffer(GenericIndexed.java:333)

at io.druid.segment.data.GenericIndexed$2.fromByteBuffer(GenericIndexed.java:323)

at io.druid.segment.data.GenericIndexed$BufferIndexed._get(GenericIndexed.java:221)

at io.druid.segment.data.GenericIndexed$BufferIndexed.get(GenericIndexed.java:189)

at io.druid.segment.data.IndexedIterable$1.next(IndexedIterable.java:60)

at io.druid.segment.QueryableIndexIndexableAdapter.(QueryableIndexIndexableAdapter.java:71)

at io.druid.segment.IndexMerger$2.apply(IndexMerger.java:226)

at io.druid.segment.IndexMerger$2.apply(IndexMerger.java:222)

at com.google.common.collect.Lists$TransformingRandomAccessList.get(Lists.java:573)

at io.druid.segment.IndexMerger.makeIndexFiles(IndexMerger.java:840)

at io.druid.segment.IndexMerger.merge(IndexMerger.java:336)

at io.druid.segment.IndexMerger.mergeQueryableIndex(IndexMerger.java:218)

at io.druid.segment.IndexMerger.mergeQueryableIndex(IndexMerger.java:207)

at io.druid.segment.realtime.plumber.RealtimePlumber$4.doRun(RealtimePlumber.java:444)

at io.druid.common.guava.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:40)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:744)

We use druid-0.8.1.

Hi,

4.9G for a single segment is quite large also the cardinality of the dimension is 48M,

We generally try to keep segment sizes around 5M rows.

To get around creating a single large segment, try sharding your data in multiple segments, for your data, I think 10 shards for an interval would be good to start with.

Another thing you can try is to index the super high cardinality dimension as a hyperunique aggregator, if all you need to is to get the number of unique values for that dimension.