we have a lambda-architecture-like pipeline (many ETLs) which is partitioned by HOUR, including Druid until this moment when it stopped scaling because we need to query data from 2015.
So I’m forced to reindex past 2 years of druid data with DAY granularity/segments for it to scale.
But the problem is that Druid is now the only component with DAY partitioning and all the other components in the pipeline have HOUR partitioning.
So I have 3 options :
either I will work it around by queuing 24 incoming hours of data and index hadoop task to Druid with DAY segment granularity
figure out how to index data hourly to druid but still getting the performance boost by having DAY segments
optimize Druid for queries that hit 17520 HOUR segments.
I couldn’t figure out 2) because as you say, the HOUR segments overshadow itself withing DAY segment.
So I’m force to do the 1) hack, which is a problem because the pipeline is using partition introspection for dependency management, ie. it uses coordinator’s metadata API to know which hours are present in druid …
So I guess I will have to do 3), keep the HOUR segments and try to optimize historical node for it to be able to handle queries that hit 17520 segments.
Thank you for responding David, Jakub