I have been working through some queries over the past few days and wanted to see if anyone has had any experience with CTAS and Druid. I see plenty advertised with respect of pulling data from Hive but looking for something inline with druid.
I believe temp tables would offer speed gains over nested queries in Druid. I am not certain on this based on Apache Calcite as the underlying agent. Excuse my ignorance if that is incorrect.
The use case is this:
I have flattened segments streaming in from a data collector on a site
When the data gets to the Druid layer it would resemble something like this
sessionId module name browser order_total
123456 products visitor1 Chrome [empty string] or 0
654321 accessories visitor2 Firefox 88.00
999999 [empty string] visitor3 Chrome 100.00
888888 products visitor4 Firefox [empty string] or 0
123456 accessories visitor1 Chrome 54.00
123456 [empty string] visitor1 Chrome [empty string]
Essentially I am trying to generate a report on the following
I would like to know all users that visited a module and had an order total greater than 0
Some users as the user visitor1 above could have the same sessionID since it persists for days. In the case above, only one record would be valid since not only did they visit a module but they ordered with a value greater than 0
**You can also see that visitor 1 had many other segments in their session, and in some cases, had no engagement with a module or even an order. These records are useless to the end goal in this query. **
**I have tried to generate a nested query but getting all sessionIDs with a module other than an empty string and then running a second query against all of those sessions that have an order total are not great. It excludes many segments or includes ones that it should not. **
**I think the performance of this will struggle so my thought was to create a temp table with the results and run a second query. I have not had luck in Superset or in Druid with the CTAS feature. **
Any ideas if this is the preferred way, or nested queries of queries is the better option. The challenge has been trying to join all of this data without the ability to use joins. I can share what I have but mostly it has been nested Where IN statements that kind of work but do not allow a full join of the data. Also the performance has not been stellar on data sets with (36 million) records. Not huge amounts of data but long term I am afraid of the scale.
Ideally my query would return the following:
Module Order total
Module accessories generated 142 dollars and the products module generated 0 dollars
**Thanks in advance. **