Druid for Variable Schema

I came across druid when I was analyzing multiple databases and much excited to work on it . Before doing further analysis I want to understand if druid caters my requirements.

Top requirements For the database I am looking

  1. Ability to run reports on realtime data(possible with druid)

  2. Support for multi tenancy(druid supports multitenancy)

  3. Dynamically added more dimensions to an existing row. As segments are immutable is it possible to do .?

  4. Each row in DB may not have same dimensions .Dimensions vary by transaction. As we introduce new transactions , new dimensions and measures are created .(Should I use schemaless dimensions ?)

  5. Can we add new measures / edit measures on an existing row.

  6. Each dimension should have key and value. Key will never change and value will change. How to define dimension spec to have key and value.

In brief, I need a database which is realtime ,schemaless and can be updated. Please help with the above questions .

Regards,

Vamsi

Hi Vamsi,

I’ll try to answer your questions in-line

  1. Ability to run reports on realtime data(possible with druid)

This is a very common use case of Druid. You will want to look into http://druid.io/docs/latest/ingestion/stream-ingestion.html

  1. Support for multi tenancy(druid supports multitenancy)

Druid is built from the ground up to excel at this.

  1. Dynamically added more dimensions to an existing row. As segments are immutable is it possible to do .?

This is not possible to do directly. One of the reasons that Druid is so fast is because of rollup (http://druid.io/docs/latest/design/index.html#roll-up) as such the original concept of a row is lost when segments are created. You will need to recreate the segment (re-index the data for that timeframe) or define your dimension in a derived way. It is common that people set up query time lookups (http://druid.io/docs/latest/querying/lookups.html) to dynamically map between data ids and variable data labels. Could you shed more light on your use-case?

  1. Each row in DB may not have same dimensions .Dimensions vary by transaction. As we introduce new transactions , new dimensions and measures are created .(Should I use schemaless dimensions ?)

Sounds like you need to setup Schemaless dimension ingestion.

  1. Can we add new measures / edit measures on an existing row.

Same as the answer to (3) you can not create a new column without reindexing the data but you can create very elaborate derived measures.

For example if you had a dimension “country” and a measure “revenue” you could add derived measures like SUM(revenue WHERE country = “USA”) [in PlyQL syntax]

  1. Each dimension should have key and value. Key will never change and value will change. How to define dimension spec to have key and value

Can you elaborate on this? Do you mean each row?

Regards,

Vadim

Thanks Vadim for the reply. Please see my replies inline.

Hi Vamsi,

I’ll try to answer your questions in-line

  1. Ability to run reports on realtime data(possible with druid)

This is a very common use case of Druid. You will want to look into http://druid.io/docs/latest/ingestion/stream-ingestion.html

  1. Support for multi tenancy(druid supports multitenancy)

Druid is built from the ground up to excel at this.

  1. Dynamically added more dimensions to an existing row. As segments are immutable is it possible to do .?

This is not possible to do directly. One of the reasons that Druid is so fast is because of rollup (http://druid.io/docs/latest/design/index.html#roll-up) as such the original concept of a row is lost when segments are created. You will need to recreate the segment (re-index the data for that timeframe) or define your dimension in a derived way. It is common that people set up query time lookups (http://druid.io/docs/latest/querying/lookups.html) to dynamically map between data ids and variable data labels. Could you shed more light on your use-case?

Lets take an example. We are fetching transactions by customer dimension.After few days we introduced a new dimension called location and need to query transactions by location. As per your answer we need to recreate segment everytime new introduce a new dimension.

  1. Each row in DB may not have same dimensions .Dimensions vary by transaction. As we introduce new transactions , new dimensions and measures are created .(Should I use schemaless dimensions ?)

Sounds like you need to setup Schemaless dimension ingestion.

  1. Can we add new measures / edit measures on an existing row.

Same as the answer to (3) you can not create a new column without reindexing the data but you can create very elaborate derived measures.

For example if you had a dimension “country” and a measure “revenue” you could add derived measures like SUM(revenue WHERE country = “USA”) [in PlyQL syntax]

  1. Each dimension should have key and value. Key will never change and value will change. How to define dimension spec to have key and value

Can you elaborate on this? Do you mean each row?

For example lets take customer dimension. Each customer will have Id and name(id:1,name:john). Name of the customer may change.But id remains same. How to save both

id and name of dimension in the segment ?

Inline.

Thanks Vadim for the reply. Please see my replies inline.

Hi Vamsi,

I’ll try to answer your questions in-line

  1. Ability to run reports on realtime data(possible with druid)

This is a very common use case of Druid. You will want to look into http://druid.io/docs/latest/ingestion/stream-ingestion.html

  1. Support for multi tenancy(druid supports multitenancy)

Druid is built from the ground up to excel at this.

  1. Dynamically added more dimensions to an existing row. As segments are immutable is it possible to do .?

This is not possible to do directly. One of the reasons that Druid is so fast is because of rollup (http://druid.io/docs/latest/design/index.html#roll-up) as such the original concept of a row is lost when segments are created. You will need to recreate the segment (re-index the data for that timeframe) or define your dimension in a derived way. It is common that people set up query time lookups (http://druid.io/docs/latest/querying/lookups.html) to dynamically map between data ids and variable data labels. Could you shed more light on your use-case?

Lets take an example. We are fetching transactions by customer dimension.After few days we introduced a new dimension called location and need to query transactions by location. As per your answer we need to recreate segment everytime new introduce a new dimension.

No, all new segments will have location going forward, and for older segments where this dimension did not exist, Druid will simply return null if you issue a query for a non-existing dimension. If you want to update of your old data to answer add in the location dimension, you will need to reindex data.

  1. Each row in DB may not have same dimensions .Dimensions vary by transaction. As we introduce new transactions , new dimensions and measures are created .(Should I use schemaless dimensions ?)

Sounds like you need to setup Schemaless dimension ingestion.

  1. Can we add new measures / edit measures on an existing row.

Same as the answer to (3) you can not create a new column without reindexing the data but you can create very elaborate derived measures.

For example if you had a dimension “country” and a measure “revenue” you could add derived measures like SUM(revenue WHERE country = “USA”) [in PlyQL syntax]

  1. Each dimension should have key and value. Key will never change and value will change. How to define dimension spec to have key and value

Can you elaborate on this? Do you mean each row?

For example lets take customer dimension. Each customer will have Id and name(id:1,name:john). Name of the customer may change.But id remains same. How to save both

id and name of dimension in the segment ?

This is best done with Druid lookups. http://druid.io/docs/0.9.0/querying/lookups.html