Sending data into Multiple DataSources

Hi,
We are trying to use Druid for our real-time analytics needs. We are currently testing it out locally.

Druid version: 0.9.2

Tranquility: 0.8.0

Zookeeper: 3.4.6

Our events data is diverse and does not have the same set of dimensions and metrics across events. For example, Some events might have 10 dimensions, others might have 3. Similarly, some events might have 3 metrics, others might have 15 etc., These events will be coming in via Tranquility into druid. We are yet to figure out a Tranquility-Kinesis link to get the data in.

Here are some references that we have gone through and derived understanding. Lets consider the scale of the raw events to about 30 Million per day. This is a rough estimate and can increase.

[1] Based on this, we resolved to have multiple data sources as Fangjin Yang has suggested and their use case is slightly similar to ours in terms of diversity of event data.

[2] This talks about having one JVM per data source. Since that is expensive, we wanted to know if our understanding is correct?

[3] But as per this, we don’t need one JVM per data source.

Here is what we want to be able to achieve:

Preconditions: The event data is in amazon kinesis stream.

When we are about to ingest the event data, based on a few conditions, we want to be able to put it in a corresponding table (datasource).

This data source will have its own set of dimensions and metrics as defined by us.

Here are our queries:

  1. We are expecting around 100+ data sources based on diversity of data. For putting data into multiple data sources will we require one JVM per data source? If yes, will it be one JVM per node?

  2. Is there an out-of-the-box solution for reading the event once and resolving which data source that event should go into?

Thank you

Hi! Can someone please reply to this?

Thanks

Would love to have a reply to the above query. WIll help us move forward

See inline

Would love to have a reply to the above query. WIll help us move forward

Hi! Can someone please reply to this?

Thanks

Hi,
We are trying to use Druid for our real-time analytics needs. We are currently testing it out locally.

Druid version: 0.9.2

Tranquility: 0.8.0

Zookeeper: 3.4.6

Our events data is diverse and does not have the same set of dimensions and metrics across events. For example, Some events might have 10 dimensions, others might have 3. Similarly, some events might have 3 metrics, others might have 15 etc., These events will be coming in via Tranquility into druid. We are yet to figure out a Tranquility-Kinesis link to get the data in.

Here are some references that we have gone through and derived understanding. Lets consider the scale of the raw events to about 30 Million per day. This is a rough estimate and can increase.

[1] Based on this, we resolved to have multiple data sources as Fangjin Yang has suggested and their use case is slightly similar to ours in terms of diversity of event data.

[2] This talks about having one JVM per data source. Since that is expensive, we wanted to know if our understanding is correct?

[3] But as per this, we don’t need one JVM per data source.

Here is what we want to be able to achieve:

Preconditions: The event data is in amazon kinesis stream.

When we are about to ingest the event data, based on a few conditions, we want to be able to put it in a corresponding table (datasource).

This data source will have its own set of dimensions and metrics as defined by us.

Here are our queries:

  1. We are expecting around 100+ data sources based on diversity of data. For putting data into multiple data sources will we require one JVM per data source? If yes, will it be one JVM per node?

Yes, Each datasource needs at least 1 task/JVM for ingestion. 1 worker node i.e middlemanager can run multiple tasks configured via druid.worker.capacity property (default is cpu cores - 1)

  1. Is there an out-of-the-box solution for reading the event once and resolving which data source that event should go into?

No, tranquility propvides simple apis to ingest data, you can easily build this on top of this in your ETL layer.

Hi Nishant,

Thanks for reply. Please let me add more details.

We have data coming in through a stream (Kinesis), and we have 150 Million events per day flowing through stream today and would got to 300 Million per day in a year. We generate events for multiple activities that happen on site . Our events could be a like on an item, a comment by a user, a page view or even a click.

Since each of the event types have a different intention , we cannot give same dimension spec for the datasource. I am not sure of how to take this forward.

Yes we can build a layer on top of tranquility that would pick event types and insert it as different datasources, but we would then have 100-150 data sources. Which sounds wrong since we really had only 1 datasource. Does it have any down sides of having 150 data sources ? From what I understand we need to have enough JVMs to support it.

If we go down one datasource, we need to have a single DimensionSpec & MetricSpec from my datasource which would not work, since I am interested in certain dimensions a,b,c for like event and interested in dimension d,e,f for comment.

Hey Nishant,
Can you please reply to Gaurav’s query. Would help us a lot in moving forward. Thanks for the help in advance

Hi Nishant,
Can you please reply. We really want to be able to use Druid in production but we cannot move forward until Gaurav’s questions are resolved. Please do let us know.

Thanks

Hi Kamal/Gaurav,
You can think of druid datasource as a table with some schema which can change over time. Now in your use case, ideally, although the events are coming from same source, they belong to different datasources.

In your case I see following possible options -

  1. Create separate datasources for separate data types. This will provide you flexibility and can support different query SLA’s for different datasources, separate tiering, retention rules. the down side being requirement of resources

  2. Use a single datasource - you can have a schema as event_type, dim1,dim2…dimn, met1, met2…metn. Your queries will then filter on event types.

  3. middle approach of above two, categorize events having similar schemas into separate datasources. and bucket similar event types into one datasource.

My recommendation would be to go with 1.

Thanks for the input Nishant.
I was also leaning towards option 1. But I came across the requirement of running 1 JVM per datasource.

That leads me into having about 150-200 JVMs , which I feel is a lot.

@Kamal can you please add the reference that says that we need 1 JVM per data source

Hi Gaurav,
Yes, you need at least 1 JVM/peon process per datasource.

To save on some of the resources, you can map datasources with very less number of events to be ingested to MMs having small JVM configs.

To achieve this, at present you can use task affinity to map some datasources to some predefined hosts - http://druid.io/docs/latest/configuration/indexing-service.html.

Hi Nishant,

Sorry I did not understand how would that reduce the number of JVMs required. I would still need 150 JVMs.

I agree I can have some JVM larger than others based on the virtual_data_source that we are creating, but I do not understand how would affinity help reduce on resources. Would I be able to map multiple datasources across single JVM and then set affinity ? I did not quite get that.

Thanks for your help!!

As i mentioned, you will still need 1 JVM per datasource, affinity would give you the ability to configure smaller jvm’s for smaller datasources.

I have a very similar use case as yours. I am still in the initial phase of POC. Were you able to figure out a Tranquility-Kinesis link to get the data in?

Hi Rahul,

we went ahead with 1 data source, using kinesis plugin that we build in-house ( no tests yet) you might want to try the one that has pull request open on tranquility .

https://github.com/GoshPosh/tranquility/tree/kinesis

Thank you for your response Gaurav. I will definitely check it out.

Rahul

This electronic message and any attachments hereto contain information that is privileged, confidential, or otherwise protected from disclosure. The information is intended to be for the addressee only. If you are not the addressee, any disclosure, copy, distribution or use of the contents of the message or any attachments hereto is strictly prohibited. If you have received this electronic message in error, please notify us immediately, and permanently delete the original message and attachments.