We have a small druid cluster that has been running for a couple of years now. There are a lot of small hourly segments that we want to merge into more optimized segments. We tried this by setting
This does create the merge tasks but it created a lot of tasks which interfered with realtime tasks.
Is it possible to specify the limit how many merge tasks can there be running at the time? Or is it possible to create a new MiddleManager node and force these jobs there to avoid interfering with realtime indexing tasks?
I think there isn’t a config to limit the number of merge tasks the coordinator will run, but it’d be a nice feature if you want to do a patch for it.
It would be cool if the druid.coordinator.merge.on was documented, we have also like 20000 20-45MB segments indexed by Hadoop Index Tasks.
I was under the impression that it is relevant to real time indexing only, but now I get the feeling that it would merge any smallish segments into bigger ones.
Both new ones and all old ones, is that correct?
We could improve performance significantly …
The druid.coordinator.merge.on feature is kind of limited: it only works with “none” shard specs, it only merges “horizontally” (across time) not “vertically” (collapsing multiple segments for the same time range), and it is generally incompatible with realtime indexing.
Feel free to use it but be aware of the limitations.
We are working on a more friendly compaction scheme which you can read about here: https://github.com/druid-io/druid/issues/4479
Gian, we use hadoop index tasks so it is “none” shard spec.
Question, if it merges horizontally across time AND also by size (512MB), will the time span of such a merged segment be arbitrary?
How is it going to look like in the deep storage? If it reduces 25 000 of HOUR segment granularity to 2000 of 512MB segments, is it going to have time range like 2017-11-1T00/2017-11-1T16 ?
Is the segment going to be stored to deep storage as “2017-11-1T00/2017-11-1T16” and Druid would drop the segments that have been merged?
Gian, I tried that to see for myself, but the druid.coordinator.merge.on feature is not compatible with segments indexed with s3n://s3Id:s3Key@bucket/path
Gian, I solved that, it works nicely, I can see new 2017-10-08T00/2017-10-08T08 segment that merged 8 hourly segments.
Question, how is the information propagated to deep storage? If I now use insert-segment-to-db on empty Mysql, would it load only those new big segments and ignore the merged ones?
Yes, it works ! It is taken care of at the deep storage too, fantastic news, I think that this extended our Druid cluster lifetime for at least one more year with even better performance than before.
Thanks guys for a great work on this feature !!!