Important: Evaluating Druid for Realtime Analytics platform

Hi Druid team,

First of all thanks a lot for creating Druid.io and making open source.

I have few basic questions regarding druid as Im evaluating Druid for my project, which is a Realtime-Analytics platform over time-series data but with ad-hoc query, olap cube features as well.

  1. Can I define schema for one data-source and then change the schema of that source, basically start sending data with more or less dimensions? Basically schema-less.

Without changing any config or bouncing any service, can I achieve this with Druid.

And I can start querying using new dimensions as well.

  1. In my system I might need to create tables for ad hoc analytics.

These tables could be only for rollup purpose or for different client systems, I might have different tables.

What I understood is, Druid doesn’t have notion of tables, but it has notion of FireHouse or DataSource.

So, is it possible to simulate or workaround if we need a table kind of feature ?

One workaround I think of is: Creating multiple data-source for each table, but that might not be efficient and it requires lots of manual steps.

Will wait for your response, as we are in very tight situation to finalize the technology for the Realtime Analytics platform.

Regards,

Manish

Hi, answers in inline.

Hi Druid team,

First of all thanks a lot for creating Druid.io and making open source.

I have few basic questions regarding druid as Im evaluating Druid for my project, which is a Realtime-Analytics platform over time-series data but with ad-hoc query, olap cube features as well.

  1. Can I define schema for one data-source and then change the schema of that source, basically start sending data with more or less dimensions? Basically schema-less.

Yes you can do this. You don’t actually need to define a list of dimensions. If you define a timestamp column and metric column, Druid can figure out your dimension columns. If you add a new column that previous segments aren’t aware of, querying old segments without this column should just return null, and querying new segments with the column will return data.

Without changing any config or bouncing any service, can I achieve this with Druid.

And I can start querying using new dimensions as well.

  1. In my system I might need to create tables for ad hoc analytics.

These tables could be only for rollup purpose or for different client systems, I might have different tables.

What I understood is, Druid doesn’t have notion of tables, but it has notion of FireHouse or DataSource.

A datasource in Druid is effectively a table in other databases. A firehose is an abstraction for a source of data. What table operations are you hoping to perform?

Thanks a lot Fangjin and Xavier !!

Please see my comments-inline and also given specific requirements for the system Im building.

Regards,

Manish

Inline.

Thanks a lot Fangjin and Xavier !!

Please see my comments-inline and also given specific requirements for the system Im building.

Regards,

Manish

Hi, answers in inline.

Hi Druid team,

First of all thanks a lot for creating Druid.io and making open source.

I have few basic questions regarding druid as Im evaluating Druid for my project, which is a Realtime-Analytics platform over time-series data but with ad-hoc query, olap cube features as well.

  1. Can I define schema for one data-source and then change the schema of that source, basically start sending data with more or less dimensions? Basically schema-less.

Yes you can do this. You don’t actually need to define a list of dimensions. If you define a timestamp column and metric column, Druid can figure out your dimension columns. If you add a new column that previous segments aren’t aware of, querying old segments without this column should just return null, and querying new segments with the column will return data.

Without changing any config or bouncing any service, can I achieve this with Druid.

And I can start querying using new dimensions as well.

  1. In my system I might need to create tables for ad hoc analytics.

These tables could be only for rollup purpose or for different client systems, I might have different tables.

What I understood is, Druid doesn’t have notion of tables, but it has notion of FireHouse or DataSource.

A datasource in Druid is effectively a table in other databases. A firehose is an abstraction for a source of data. What table operations are you hoping to perform?

So, I need to build an OLAP system with following features / requirements:

  • Query on different dimensions of time-series data.
  • Load = 100 million / min
  • Ad-hoc queries
  • Realtime query
  • Defining cubes and populating cubes.
  • Alerts on the basis of cube results.
  • UI to explore the data, write queries.
  • Can change schema at runtime. Then users can query on those dimensions effectively.

This all sounds fine.

  • ---- So over same data-source we need to define multiple cubes, and keep updating.

I don’t really understand what you are trying to say here.

  • ---- I cannot keep the querying at runtime for every user defined cube to generate graphs or alerts. As this will not be efficient in any system including DRUID ( because it Cube can have any group by, filter order etc)
  • ---- So, I will store these cubes and keep populating based with some freq, which is like fetch latest data and update the old number based on cube definition.
  • ---- Plus as I need to give ad-hoc query feature, that should be handled by DRUID, but I need to may be optimize the queries for group by clauses.
  • So, do you guys think based on my requirements, will DRUID be a good choice or not ?

I don’t really understand how you plan to store tables (data sources) in Druid. Can you provide some more information about what you are thinking?

Thanks Fangjin !!

Please see inline:

Inline.

Thanks Fangjin !!

Please see inline:

Inline.

Thanks a lot Fangjin and Xavier !!

Please see my comments-inline and also given specific requirements for the system Im building.

Regards,

Manish

Hi, answers in inline.

Hi Druid team,

First of all thanks a lot for creating Druid.io and making open source.

I have few basic questions regarding druid as Im evaluating Druid for my project, which is a Realtime-Analytics platform over time-series data but with ad-hoc query, olap cube features as well.

  1. Can I define schema for one data-source and then change the schema of that source, basically start sending data with more or less dimensions? Basically schema-less.

Yes you can do this. You don’t actually need to define a list of dimensions. If you define a timestamp column and metric column, Druid can figure out your dimension columns. If you add a new column that previous segments aren’t aware of, querying old segments without this column should just return null, and querying new segments with the column will return data.

Without changing any config or bouncing any service, can I achieve this with Druid.

And I can start querying using new dimensions as well.

  1. In my system I might need to create tables for ad hoc analytics.

These tables could be only for rollup purpose or for different client systems, I might have different tables.

What I understood is, Druid doesn’t have notion of tables, but it has notion of FireHouse or DataSource.

A datasource in Druid is effectively a table in other databases. A firehose is an abstraction for a source of data. What table operations are you hoping to perform?

So, I need to build an OLAP system with following features / requirements:

  • Query on different dimensions of time-series data.
  • Load = 100 million / min
  • Ad-hoc queries
  • Realtime query
  • Defining cubes and populating cubes.
  • Alerts on the basis of cube results.
  • UI to explore the data, write queries.
  • Can change schema at runtime. Then users can query on those dimensions effectively.

This all sounds fine.

  • ---- So over same data-source we need to define multiple cubes, and keep updating.

I don’t really understand what you are trying to say here.

Hmm…. for example if I have table t =

{

name string

city string

country string

zip int

continent string

street string

new_customer boolean

id long

income decimal

taxable_income decimal

is_job_employee boolean

}

Above is a sample table, which is some user-financial details.

Now, we will give a dashboard where user wants to define there own charts- which are nothing but SQL queries like.

query 1: Select count(*) from t where t.city=‘NewYork’ group by zip,street

frequency to update result of this query will be say hour.

query 2: Select avg(income) from t where group by country,city,zip,street order by income

frequency to update result of this query will be say 30 mins.

I understand your use cases now and Druid is able to answer these queries in an ad-hoc fashion. You do not need to precompute or predefine your queries. You will need to define the columns if your data though. If you use Druid’s realtime ingestion, and if your data set is constantly appending new events, you can receive updates in seconds.

These are the queries which I’m referring as CUBES in the system.

Now the other case is these type of queries users can also fire as ad-hoc queries, which are not pre-defined and needs to be evaluated at run-time.

So, I see two patterns which, I’m not sure Druid is best suited or not.

  1. Providing system where users can define the cubes, and based on the frequency, instead of re-calculating every time,

In the backend, it can just update the counters for count or avg. either when the event inserted or based on the frequency.

  1. I would be having many use cases of Group-By, and Druid.io recommends to run multiple queries and then merge the results, instead of running group-by queries.

Groupby queries are flexible but slow, and in my experience, most questions answered by groupBys can also be answered by iterated topNs, which should be faster.

Thanks Fangjin,

Few more questions :slight_smile: as need to understand more before doing PoC:

  1. How the performance of Realtime node, there will be simultaneous push and pull load

  2. Any specific design decision that will help in this case?

  3. How Druid.io is compared to Linked in Pinot : https://engineering.linkedin.com/analytics/real-time-analytics-massive-scale-pinot ?

  4. How about ZK as the bottleneck in very-very big cluster, as Kafka and Storm are trying to remove or use it minimally to reduce the impact,

Its a bit open ended question :), but would like to hear from the expert on the design, future plans of ZK on Druid.io platform ?

Regards,

Manish

Hi Manish, see inline.

Thanks Fangjin,

Few more questions :slight_smile: as need to understand more before doing PoC:

  1. How the performance of Realtime node, there will be simultaneous push and pull load

On average, we see about 23k events/sec ingested by these nodes. The query load is on the order of < 100 requests per second.

  1. Any specific design decision that will help in this case?

Do you mean improve performance?

  1. How Druid.io is compared to Linked in Pinot : https://engineering.linkedin.com/analytics/real-time-analytics-massive-scale-pinot ?

Architecture-wise, Pinot appears to be very similar to Druid. The nodes are named the same, their shards are also called segments, the internals of the segments also appear almost identical. In terms of features, Druid appears to support everything that Pinot does, although Pinot has a built in SQL-like query language whereas you have to use an external query library to use SQL for Druid. Druid additionally adds autoscaling real-time ingestion, tiered data storage, pluggable computations, approximate histogram and quantiles, approximate cardinality estimation, and support for R, Python, Node.js and other libraries. Pinot’s public numbers around scale are 300x less than Druid’s based on what I last saw.

  1. How about ZK as the bottleneck in very-very big cluster, as Kafka and Storm are trying to remove or use it minimally to reduce the impact,

Its a bit open ended question :), but would like to hear from the expert on the design, future plans of ZK on Druid.io platform ?

We would like to move entirely off of ZK in the future.

Thanks a lot Fangjin !!

I would be sharing more information, as I move forward on this project.

Regards,

Manish