I’m running a medium-sized Druid 0.21.1 cluster with 1 master, 4 data servers, and 1 query server, using the example configs from the Druid Clustered Deployment article. Servers sizes are as recommended by that article.I have a data source which contains about 11.6Gb of data in around 10 000 segments or 82 million rows. I receive around a million new rows each day.
I asked about compaction and performance before and y’all were super helpful in figuring out what was going on, so I figured, I’d continue the saga. Query performance has vastly improved, but for very large datasets (say 30 million rows, and filtering a multi value string dimension), queries still run for over a minute or so. I feel like this could go way faster. My question is: Where should I go next in optimizing this?
I’m running automated compaction tasks with partitioning type single_dim, and the dimension is the value I filter by the most. However, the Segment Size Optimization Docs say I should aim for a segment size of around 5M rows, and I definitely am not reaching that. According to the datasources screen, my segments are between 0.008M and 0.170M rows. I think this is because the granularity is set to hour, and there is just not more data to be stored – I get 24 segments per day, one per hour. Does it make sense here to reduce the granularity of the data in order to get bigger segments? If yes, I will basically lose the hours timestamp in all my rows, correct? I think I’d be willing to sacrifice that.
I have a strong suspicion that my queries could be optimized. As noted before, rows in my Druid data have tags, in a multi value string dimension. I’ve noticed there are dedicated filtering options for these in the Druid Native Query Language, whereas with SQL , handling these feels pretty clumsy. Here’s an example query – do you think rewriting this in JSON could help?
SELECT "payload" as "xAxisValue", COUNT(DISTINCT clientUser) AS "yAxisValue" FROM "my-cool-data-source" WHERE clientID = 'SOME-ID' AND __time BETWEEN '2021-08-31T00:00:00Z' AND '2021-10-12T09:21:34Z' AND CONTAINS_STRING(payload, 'systemVersion:') GROUP BY 1 ORDER BY 2 DESC
I’m very much looking forward to your comments. I’m learning a lot in this community, and I’m super grateful. On a meta note, should I split up these kinds of topics like I would on stackoverflow? Or is one large question with lots of context better?