Are my segment sizes too small, or should I optimize my queries?

My Setup: Apache Druid 0.21.1

I’m running a medium-sized Druid 0.21.1 cluster with 1 master, 4 data servers, and 1 query server, using the example configs from the Druid Clustered Deployment article. Servers sizes are as recommended by that article.I have a data source which contains about 11.6Gb of data in around 10 000 segments or 82 million rows. I receive around a million new rows each day.

Where should I optimize next?

I asked about compaction and performance before and y’all were super helpful in figuring out what was going on, so I figured, I’d continue the saga. Query performance has vastly improved, but for very large datasets (say 30 million rows, and filtering a multi value string dimension), queries still run for over a minute or so. I feel like this could go way faster. My question is: Where should I go next in optimizing this?

Granularity and Segment Size

I’m running automated compaction tasks with partitioning type single_dim, and the dimension is the value I filter by the most. However, the Segment Size Optimization Docs say I should aim for a segment size of around 5M rows, and I definitely am not reaching that. According to the datasources screen, my segments are between 0.008M and 0.170M rows. I think this is because the granularity is set to hour, and there is just not more data to be stored – I get 24 segments per day, one per hour. Does it make sense here to reduce the granularity of the data in order to get bigger segments? If yes, I will basically lose the hours timestamp in all my rows, correct? I think I’d be willing to sacrifice that.

Native Queries instead of SQL Queries

I have a strong suspicion that my queries could be optimized. As noted before, rows in my Druid data have tags, in a multi value string dimension. I’ve noticed there are dedicated filtering options for these in the Druid Native Query Language, whereas with SQL , handling these feels pretty clumsy. Here’s an example query – do you think rewriting this in JSON could help?

SELECT
  "payload" as "xAxisValue", COUNT(DISTINCT clientUser) AS "yAxisValue"
FROM "my-cool-data-source"
WHERE clientID = 'SOME-ID'
AND __time BETWEEN '2021-08-31T00:00:00Z' AND '2021-10-12T09:21:34Z'

AND CONTAINS_STRING(payload, 'systemVersion:')

GROUP BY 1
ORDER BY 2 DESC

Thanks in advance

I’m very much looking forward to your comments. I’m learning a lot in this community, and I’m super grateful. On a meta note, should I split up these kinds of topics like I would on stackoverflow? Or is one large question with lots of context better?

I had quick glance at your questions and notice the following -

Huge Number of Tiny segments - 11.6Gb of data in around 10 000 segments or 82 million rows =>That means, on average ~1.16 MB per segment with ~8k rows per segment

On Granularity part - There are two types of granularity in druid

(1) segmentGranularity - * Time interval (for the time chunk containing the segment; this corresponds to the segmentGranularity specified at ingestion time).

(2) queryGranularity - The resolution of timestamp storage within each segment. This must be equal to, or finer, than segmentGranularity . This will be the finest granularity that you can query at and still receive sensible results.

E.g., a value of minute will mean that records will be stored at minutely granularity, and can be sensibly queried at any multiple of minutes (including minutely, 5-minutely, hourly, etc)

I think you can run a compaction job with higher segmentGranularity leaving the QueryGranularity as its to finely query your data. Based on your explanation, If you are getting 24 segments a day, and each segment is of ~1MB in size (~8k rows), then for a complete month, it would be 24 * 30 * 1 ~720 MB or (24 * 30 * 8k~ 5.8 Million rows, excluding rollup ) which seems a reasonable segment size.

You can give a try running a compaction job with segmentGranularity=MONTH, As a test, you may try as below -

(1) Select an older month ( ley say - JULY 2021).
(2) Count # of segments for the JULY 2021
(3) Count the #total number of rows the JULY 2021
(4) Now, Run a compaction Job, something like -

{
  "type": "compact",
  "dataSource": "$DATASOURCE_NAME",
  "interval": "2021-07-01/2021-07-31",
"granularitySpec" : {
    "segmentGranularity" : "MONTH"
  }
}

(5) Once done, Do the same exercise as described in (1),(2), and (3) to observe the impact of compaction.

I just focused here on one dimension which is to optimize the existing segments as I feel that’s pretty much required in your case.

1 Like

Perfect, that’s just the type of reply I was hoping for! I didn’t realize there was a difference between segment granularity and query granularity, so that helps tremendously already!

I’ll try out compacting manually with a lesser segment granularity and report back :slight_smile:

So, here’s an update: I ran manual compaction jobs with a segment granularity of MONTH up to and including last September. These are in addition to the automatic compaction jobs that were running anyway, for those I changed the segmentGranularity to DAY.

I now have way less segments, and the rows per segments are higher: around 3400 segments containing 8.33Gb of Data, with an average row count per segment of 0.024M and a max of 5M.

I am getting a small improvement in loading times, but not a big leap like I was hoping for.

I did however notice two things:

  1. Suddenly the queries are no longer CPU constrained. Previously when I would launch a long-running query, all data servers would go to ~98% CPU for the duration of the query. Now, the CPUs are basically idling, even though my test query still takes a minute to run.

  2. The newest two days of my data are still very chaotic. Before my automatic compaction job only looks at data more than 25 hours old, I get literally thousands of new segments. I assume this is because the rows arrive out of order, with rows arriving up to 24 hours after their __time variable. The thing is: A query with __time BETWEEN '2021-08-01' AND '2021-09-30' runs in about the same time as a query with `__time BETWEEN ‘2021-09-01’ AND ‘2021-10-31’, even though the latter query touches way more segments. The latter query does spike the CPUs way more though.

Is there anything else I can investigate to improve performance? I’ll keep you posted :slight_smile: