Useful Apache Druid Metadata Queries

Schema

What is the schema?

Sometimes you just want to know what someone has actually ingested into Imply.

SELECT "ORDINAL_POSITION", "COLUMN_NAME", "IS_NULLABLE", "DATA_TYPE", "JDBC_TYPE"
FROM INFORMATION_SCHEMA.COLUMNS
WHERE "TABLE_NAME" = '{table-name}'

You could drop the output of this into a spreadsheet to begin the conversation of refining it. Such a spreadsheet would look like this:

Name Foreign Key (Lookup) Use for new dimension value Use to Filter Use to Sort Use to find Top Results Use for Calculations Use for Grouping Use to Bucket Make Numerical Stats First or Last Value Rolled Up Do Set Operations Estimate Quantiles Estimate Cardinality Druid Type Suggestion
__time X X X TIMESTAMP
strSOMETHING VARCHAR
intSOMETHING

This is useful not just to identify some quick wins (strings versus numbers) but also to provoke a conversation about the schema itself, knocking out columns that have low operational or enduring value to the business on the way.

Roll-up and Cardinality

See Ingestion: Rollup and Cardinality

Segment Profiling

See Data Ops: Profiling segments with SYS tables

Is auto-compaction working? What effect would compaction have?

The default trigger for automatic compaction (inputsegmentsizebytes) is 419430400 bytes. But what could that figure be? And how would it help get an optimal segment size for - well - the entire datasource? This will tell you.

To test different possible effects of different inputSegmentSizeBytes, change the figure following HAVING SUM(“size”).

SELECT
  datasource,
  COUNT (*) as "intervals",
  SUM(intervalSegments) as "totalDsSegments",
  MIN(intervalSegments) as "minDsIvlSegments",
  MAX(intervalSegments) as "maxDsIvlSegments",
  SUM(intervalSize) as totalDsSize,
  MIN(intervalSize) as minDsIvlSize,
  MAX(intervalSize) as maxDsIvlSize,
  AVG(intervalSize) as avgDsIvlSize,
  SUM(intervalRows) as totalDsRows
FROM (
SELECT
    datasource,
    LEFT("segment_id",CHAR_LENGTH("datasource")+50) as "interval",
    COUNT(*) as intervalSegments,
    CEIL(SUM("num_rows") / 5000000) as idealSegments,
    SUM("size") as intervalSize,
    SUM("num_rows") as intervalRows
  FROM sys.segments
  WHERE is_available = 1 AND is_overshadowed = 0 AND is_realtime = 0
  GROUP BY 1, 2
  HAVING SUM("size") > 419430400 AND count(*) > CEIL(SUM("num_rows") / 5000000)
)
GROUP BY 1

When you set up compaction, general advice is that each job can process about 2GB (2000000000 bytes) in half-an-hour. Over this value, it’s worth using subtasks (that is, parallel compaction).

:warning: There is a merge phase at the end of the compaction process. So, with a 60GB inputSegmentSizeBytes you may need to spawn 30 sub-tasks - which is lots of merges!!!
:warning: GRADUALLY increase inputSegmentSizeBytes or you will spawn a multitude of compaction tasks and overwhelm the cluster.