Druid Join of 2 Table DataSources using Druid SQL

Hi Experts ,
What does this exactly mean ?“In Druid 0.18.1, joins are implemented with a broadcast hash-join algorithm. This means that all tables other than the leftmost “base” table must fit in memory.” in https://druid.apache.org/docs/latest/querying/datasource.html#datasource-type

I am not sure on what does “must fit in memory” mean here? Doesnt it depend on the interval we query etc.

Could you please help explain whats the caveat if we want to JOIN 2 datasources with equal amount of data being ingested.

Thanks in advance!

Fo example
Select * from

(Select * from table1 WHERE __time BETWEEN ‘2020-05-10T00:00:00’ AND ‘2020-05-17T00:00:00’) as x

INNER JOIN

(SELECT * FROM table2 where __time BETWEEN ‘2020-05-10T00:00:00’ AND ‘2020-05-17T00:00:00’) as y

ON x.column1 = y.column1

AND x.column2 = y.column2

Will there be any issues if table2 months of data

Late reply, but I think the docs might be a little inconsistent here in use of “table” vs “datasource” (unless they’re not, and I’m wrong).

I’d hope that “all tables other than the leftmost “base” table must fit in memory” should really be “all datasources other than the leftmost “base” table must fit in memory”. Elsewhere in the docs, they say datasource is a more general term, and can be a table (regular datasource), inline query, lookup, etc.

Broadcast hash join does need to broadcast entire result sets to brokers for the join, but I would expect it’s the resultset, not the entire datasource table, which would be pretty bad. So, I expect it to load the results of your query to heap, not the base table datasource you queried on. I hope it just grabs the interval you asked for. I’d be surprised if it didn’t.