Druid for real time analysis of Ad Platform Data

Hey Guys,

Thanks to the whole community for open sourcing Druid .

I was pretty amazed by the exciting features Druid offers and I am planning to shift to Druid for real time data analysis. So we are a small team working on integrating various advertisement platforms (Facebook, pinterest) with the analytic platforms (Google analytics, DAX) to help marketers generate some meaningful insights from the data and optimize their ad performances. Currently we are pulling data from Facebook and Google and storing it in a postgres DB. The frequency of above operation is like every 30 mins (but in future it may reach every 5 mins or so).

Since Data in Druid is saved in forms of DataSources rather than tables, I am defining a dummy datasource according to my use-case :

the fields of the datasource will be as following:

{

timestamp

id

transaction_revenue

spend

gender

city

likes

tags

}

  1. So if I want to use multi - dimensional query like “What is the total spend and revenue for the period timestamp1 to timestamp2 where gender was X and city was Y and likes is Z”

What would be the latency with respect to time if simultaneously 100 queries are executed from the front-end and if we increase the dimensions(3 in above query) to around 10-15.

  1. Can we query across different DataSources at once or in a single query ?

  2. For now I want to import the data from my postgres Database to Druid. The batch ingestion tutorial required Hadoop for the same. I am not very comfortable in using Hadoop so is their an alternate way for batch ingestion or importing the data into Druid ?

  3. Real time ingestion requires Apache Kafka to be integrated with Postgres , where postgres acts as a producer and then kafka as a consumer can pass the real time data added to druid real time ingestion nodes. But I’m facing problems integrating postgres with kafka since not many open source tools are available of which most are broken. I need help with that too ? Druid forum might not be the best place to ask how to integrate Kafka with Postgres but any alternative solution to this problem from an experienced community can be of great help .

Thanks a lot for the help . A lot more questions coming down the way after setting up Druid and integrating it with Postgres.

Hi Shantanu,

Some thoughts on your questions.

  1. With proper tuning Druid should do those kinds of queries in less than a second.

  2. Some cross-datasource queries are possible. Druid has a “union” query where it can make the same query across multiple datasources that have the same schema as if they were a single table. But it can’t do datasource-to-datasource joins.

  3. The hadoop based indexer is definitely the best way to load batch data in production. For small loads (~GB or less) you could use the non-hadoop, plain index task: http://druid.io/docs/latest/misc/tasks.html

  4. I think the Kafka community can probably help you more here. But I think there are a couple of approaches you could take. First is to insert directly into the postgres db and then export from that to kafka; if you want to do that, this blog post looks like a good place to start: http://www.confluent.io/blog/2015/04/23/bottled-water-real-time-integration-of-postgresql-and-kafka/. The other approach is to actually insert your data directly into kafka, and have consumers that export into postgres, hadoop, druid, and whatever else you need to export to.

Hope this helps!

Hey Gian ,

Thank you for the awesome reply . Specially for the third part. Will get back soon after testing it out

Cheers