Storing many schemas within a single datasource vs multiple datasources

Assume I have 2 schemas that need to be stored in Druid and rolled up.

Schema 1:

Dimensions: D1, D2, D3

Metrics: M1, M2, M3

Schema 2:

Dimensions: D2, D3, D4

Metrics: M2, M3, M4

One way of organizing this is to put each schema in its own datasource.

Schema 1 in Datasource1


Schema 2 in Datasource2

Question 1: Does it make sense to store both schemas in a single datasource ?

And have an identifier dimension say “D0” that might hold a flag with value “S1” or “S2” to denote that it belongs to that schema/datatype.

For example,

Schema 3:

Dimensions: D0, D1, D2, D3, D4

Metrics: M1, M2, M3, M4

Question 2: Will it affect performance or roll-ups (considering some dimensions/metrics might be null or empty when storing other schemas) or disk space?

Question 3: Does Druid have a limit/best-practice on the number of dimensions or metrics we can have in a single datasource ? Is it ok to run into the thousands ?

Hi Kiran,

Thousands of columns is ok, although you might need to adjust some tuning configs away from the defaults. For example, you might need to set maxRowsInMemory and maxRowsPerSegment lower. You also might need more direct memory in druid.indexer.runner.javaOpts of your middleManager (direct memory usage increases as you add more columns).