Join big fact data sources

Hi All,

I am new to Druid and it seems an interesting project! Thank you for the great work

We’re evaluating Druid’s suitability and performance, since our case requires to build an user-facing interface, where users can build audiences applying multiple filters and aggregations into various huge fact tables(>100m). We’ve been using Redshift, but this implementation hasn’t scaled well due to the high cardinality of the data, and the storage cost made us thought it wasn’t the best tool for the job performance limitations.

I know Druid does not support joins, but I would like to know how others are tackling that limitation. In our case, each data source has key IDs that can be used as joins, but they have different dimensions and are very granular.

Before deciding to use Druid, I need to be sure of the approach to use and I would really appreciate your help with. I’ve been considering these approaches(I hope they don’t sound crazy):

A) Use Joins in ETL: I read that some users suggest to do all the joins in the ETL stage, before ingesting to Druid. However, in my case it would mean doing a cartesian join, which is very expensive due the high granularity and differences between data sources. Am I wrong on this?

B) Additional application layer: Keep and query the data sources separate, when a user in the interface applies filters in a view of one data source the numbers of records would be smaller and consequently with the help of another application layer do the cross joins with other data sources. How that could be achieved?

C) **Union **data sources: One user suggested this approach, but I believe this can’t be achieved because each data source is very different.

Apologies for any obvious questions and thank you in advance for the response.

Regards,

Eberto

Hey Eberto,

It all depends on what kinds of joins you want to do. For fact-fact joins the only type that really makes sense at query time in Druid today is a semijoin (filtering a query based on another query result). Druid SQL supports those for “reasonable” matching-set sizes; search for “semi-join” on http://druid.io/docs/latest/querying/sql for details. There has also been some recent work in doing bigger, approximate semijoins using bloom filters. Check out https://github.com/apache/incubator-druid/pull/6502 for one of the patches in this series.

For any other kind of fact-fact join, you’d want to do it before ingesting data into Druid. The idea is you prejoin the data and load it after the join has been done. Of course this only works if you want to do the same join over and over again, not if you have lots of different join conditions you might want to use.