[druid-user] Druid Schema Change

Hi team,

I’ve recently started working on druid and have a decent understanding about how druid works. Currently I’ve one dataSource in druid which is having around 25 fields, out of them 2 are metrics and others are dimensions.

I’ve got the new requirement where I’ll have to change the schema for that dataSource significantly. The schema changes would be like below:

  1. Addition of 10-12 new dimensions and those will be having low cardinality

  2. Addition of 1-2 metrics (either thetaSketch or quantilesDoubleSketch)
    This change doesn’t involve modifying any current field. Now I’m finding answers for below questions

  3. How this schema change would overall affect the storing of existing data or new data getting ingested for existing fields?

  4. I believe for newly added dimensions or metrics, values will be null for existing rows and this will create sparsity, will this affect the performance of queries?

  5. Based on above or any other points which I might be missing around this, what would be the recommendation to go ahead with this schema change? like should changes be done in current dataSource schema itself or new dataSource should be introduced with combination of existing schema and new changes and new data which requires this new schema would be ingested in this new dataSource?
    Thank you in advance.

Hey!

Hopefully I can help a little here?

How this schema change would overall affect the storing of existing data or new data getting ingested for existing fields?
Each segment contains its schema – so you will only see effects in new segments that you create – that is, whether it’s adding new data to the timeline, or you have re-indexed or updated the existing segments.

I believe for newly added dimensions or metrics, values will be null for existing rows and this will create sparsity, will this affect the performance of queries?
As the data is columnarized, I do not believe it would be significant for your new dimensions – especially if you have employed indexes. I would presume that you will see some increase in the size of your segments, so that may increase scan times a little. It sounds like you are using roll-up, however? In that case, imagine that you are doing a GROUP BY on your incoming data – will the new dimensions reduce the roll-up ratio? If so, you could have more rows emitted – so it is not just a question of width of the table, but also the length of your new data that may impact things.

should changes be done in current dataSource schema itself or new dataSource should be introduced with combination of existing schema and new changes and new data which requires this new schema would be ingested in this new dataSource?

That’s a tricky one…! Do you need those dimensions to be back-filled or just going forward? How big is the data ingestion likely to be? Do you need to update any apps sitting on the top of Druid to use the new table? Adding new dimensions to an existing table from a point in time is safe for sure. Adding it historically will be a new ingestion over the old intervals, which will create new versions of all the underlying segments safely (so you can keep querying the old table while it happens) – but then a new table appeals to me because it is all nice and neat, and you can switch from one to the other when you know it’s working. You could also do performance comparisons and so on. It is a difficult one to give an (a) or (b) answer…

I hope this is not too confusing!

1 Like

Hi Peter,

Thanks for your detailed answer first of all. I’ll try to answer questions in your answer for the third question I asked.

Do you need those dimensions to be back-filled or just going forward?

=> All newly added dimensions / metrics will be filled with just going forward and no back filling will be there.

How big is the data ingestion likely to be?

=> Right now there are just 4-5 events that are getting ingested in that dataSource per second, but this count will be going to increase soon, currently not sure upto what number it’ll increase.

Do you need to update any apps sitting on the top of Druid to use the new table?

=> Superset dashboard is powered from this dataSource and time series analysis is done over that.

Thank you!