Data Ops: Profiling segments with SYS tables

The sys tables in Druid are super useful for profiling segments, especially when you’re ingesting a new data set and you want to see what the hell is going on. Here’s some that I’ve used - I welcome people chipping in with their own thoughts!

All these can be executed in the console’s Sql interface:

What are the characteristics of segments inside each datasource?

This is really useful for just getting a feel for what the segments are like in your Druid datastore. This query is focused on segments that have been pushed and published (ie, not real-time).

SELECT "datasource",
  COUNT(*) as segments,
  SUM("size") / 1048576 as totalSizeMb,
  MIN("size") / 1048576 as minSizeMb,
  AVG("size") / 1048576 as averageSizeMb,
  MAX("size") / 1048576 as maxSizeMb,
  SUM("num_rows") as totalRows,
  MIN("num_rows") as minRows,
  AVG("num_rows") as averageRows,
  MAX("num_rows") as maxRows,
  (AVG("size") / AVG("num_rows"))  as avgRowSizeB
FROM sys.segments
WHERE is_available = 1 AND is_overshadowed = 0 AND is_realtime = 0
GROUP BY 1

Focusing in on one day can be useful for comparisons between lava data sources (filtered, few dimensions, rolled-up with approximations…) versus raw data sources.

Druid picks up all the data it has available for the period of time that you are asking about. It’s like picking up all the books in a library written between two dates. Druid then starts to read each one (in parallel) to find the relevant data (filter) and to compute statistics (sum, average, count, etc) in groups (group by).

Making the books smaller reduces how long it takes to read each book, and might also reduce the number of books overall. This is like giving Druid a section of a library that’s focused on answering one set of questions, instead of asking it to look through the entire library. It’s a useful technique to swing the balance towards maximum return on investment, and is particularly useful when the same filters are almost always applied by end users.

Every “library book” needs to be read by someone when you ask Druid a question. If you have 100 people, it will be 100 x faster than if you only have 1 person. By adding more “people” (cores), Druid has more book-reading capacity, so for faster initial Dashboard draw times, more cores will be needed.

What buckets do each segment fall into?

Let’s understand the variation in the size of your segments by giving stats in million-row buckets. Nice.

SELECT "datasource", ABS("num_rows" /  1000000) as "bucket",
  COUNT(*) as segments,
  SUM("size") / 1048576 as totalSizeMb,
  MIN("size") / 1048576 as minSizeMb,
  AVG("size") / 1048576 as averageSizeMb,
  MAX("size") / 1048576 as maxSizeMb,
  SUM("num_rows") as totalRows,
  MIN("num_rows") as minRows,
  AVG("num_rows") as averageRows,
  MAX("num_rows") as maxRows,
  (AVG("size") / AVG("num_rows"))  as avgRowSizeB
FROM sys.segments
WHERE is_available = 1 AND is_overshadowed = 0 AND is_realtime = 0
GROUP BY 1,2
ORDER BY 1,2

What is segment balance like over time?

Maybe you’re using manual compaction. Maybe you’re using auto compaction. Maybe you have variation in data volumes. Maybe you don’t like cake. If so, this query is for you.
One output may be to adjust skipOffsetFromLatest for auto-compaction so more recent intervals are compacted. Another may be to amend maxSegmentsToMove so the coordinator moves more data around to achieve balance…

Notice that the query uses LEFT to string-extract the time period, and the WHERE clause also uses LEFT to allow you to filter by datasource name. So you may need to adjust the 39, for example, to make sure it extracts the right bit of text. Out the back you get a server-by-server count of the number of segments on each server over time.

You don’t have to use COUNT of course - some other stat may be more useful to you like a sum of the number of rows on that server to get a view of how busy that server would be if someone queried a particular time period.

SELECT
  LEFT("segment_id",39) as "interval",
  server,
  COUNT(*)
FROM sys.server_segments
WHERE LEFT("segment_id",{length-of-datasource-name}) = '{datasource-name}'
   AND server LIKE '%:8283'
GROUP BY 1,2
ORDER BY 1,2

You can drop the results of this into a pivot chart in your favourite spreadsheet application, and use conditional formatting to highlight differences in each interval’s location across the servers. Here’s a sheet I used when ramping up the auto-compaction on a very fragmented dataset: I could watch as the number of segments dropped, day by day, across the cluster.

Tested on Druid 0.17

1 Like