DataSketch Theta sketch ingestion on segment size

Hello Druid Community,

Is there a way to reduce segment size using theta sketch?
I started with originally 125 M raw records
This is ingested into 4 segments. Each segment rolled up to roughly 11M to 12M entries in druid.
original data without theta sketch, avg seg size: 271 MB
theta sketch size=512, avg seg size: 602.03MB
theta sketch size=1024, avg seg size: 602.39MB
theta sketch size=2048, avg seg size: 602.65MB
theta sketch size=4096, avg seg size: 602.80MB
theta sketch size=8192, avg seg size: 602.87MB
theta sketch size=16K, avg seg size: 602.94MB

I don’t see a significant increase in segment size but, it requires more reducer memory.

Is there a way to:

  • bring down the avg segment size?
  • decrease reducer memory usage?
  • speed up ingestion process? (I am using hadoop-based ingestion)

Thanks,
Ling

Hi Ling,

The general recommendation for the number of rows in a segment is around 5 million. So you would want to probably set
targetRowsPerSegment or numShards which should help bring down your segment size.

One thing that helped us out with significantly speeding up the ingestion process with sketches was to use the hadoop combiner.

Try setting “useCombiner”: true in the tuningConfig and see if it helps your use case.
You may also want to set maxBytesInMemory:-1 to prevent the processes from continuously spilling to disk.

We also noticed that our mappers were spending a lot of time spilling to disk when sorting.
You can look into tuning [mapreduce.task.io](http://mapreduce.task.io/)``.sort.mb to a higher value if you are seeing similar behavior.

I would be curious to know if these tweaks worked out for you as well.

Hello Samarth,
thanks for the suggestion.
I’ll try out useCombiner, maxBytesInMemory:-1, and increase mapreduce.task.io.sort.mb.

Best,
Ling

Hello Samarth,
I’ve tested a combination of useCombiner, maxBytesInMemory:-1, and increase mapreduce.task.io.sort.mb.
I didn’t see much of difference using maxBytesInMemory:-1 and increase mapreduce.task.io.sort.mb. However, enable useCombiner=true decrease processing time by 22%
Regarding average segment size, is there a way to tune the size increase caused by the sketch?

Thanks again.

Best,
Ling