Join different data stream into one datasource

Hi there,

I want to join two data stream into one datasource, for example , data stream one with schema like : Id , columnA,columnB,columnC , data stream two with schema like : Id , columnD,columnE,columnF

if there any method to solve this scene without changing source data with both data streams. How can I design druid datasource scheme?

Hope I make it clear!

BR

Johnny

Hey Johnny,

By “join” do you mean joining by a key or do you mean just unioning everything together as if it were a single datasource?

If you mean joining by a key, Druid can’t do that innately, so people generally use a stream processor to do that and generate a new, joined stream.

If you mean unioning everything together, then you could do that in a couple ways. One is to union the streams externally to Druid and have Druid read the single unioned stream (for example: put them in the same Kafka topic, send them both with the same Tranquility configuration, etc). Another is to read the two streams into two different Druid dataSources and use Druid’s “union” query functionality to union them at query-time. The query time “union” treats multiple dataSources as if they were a single dataSource with a merged schema.

1 Like

Hi,

we currently have the same requirements as stated above, namely to ingest data from different datacenters/regions separately and were looking into options.

UNION queries were our favorite option but tests have shown that UNION queries are very slow.

We don’t understand why Druid is executing a UNION query by sending separate queries to each datasource sequentially.

To my understanding, it should be possible to execute queries in parallel and then merge the results or is there a reason for the sequential execution?

Furthermore, I think it would be nice if historical queries could contain segment lists in which the datasource name occurs in each segment reference. This way, a broker could tell a historical to scan segments from various datasources that have a compatible schema and already merge those results on the historical.

This feature could be used in many contexts. It would make UNION queries faster and could be used for datasource-delegation/OLAP kind of queies that combine data from different cuboids. Would this be feasibily to add as a feature?

thanks