Query of Queries Help

I have run into a situation that may be directly related to how we are capturing click stream data. Essentially what I am trying to do is the following.

I want to query over all of my segments to generate a report. I need to run a query of queries I believe to achieve the desired results.

He is an example of what I am doing:

select “property_a”, “property_b”, “property_c”
from “datasource”
where “property_a” IN (
select “property_a”, “property_b”, “property_c”
from “same datasource as above”
where “property_b” IS NOT NULL
)
GROUP BY “property_a”
LIMIT 10000

I apologize if I butchered the syntax…I have modified it multiple times based on SQL syntax and the exceptions I get back lack details to debug.

Basically I am running my first query and getting back a bunch of data. From that data I want to run a second query where one column from the first query is checked against a second query of sessionId and a property that may or may not have value. Both queries are running against the same datasource.

I am assuming I could run two independent queries and then aggregate the data. I would like to avoid this if possible, rather I would opt for the query to handle both and aggregate it for me.

I also tried a UNION ALL but it is against the same datasource, not two different ones. I get strange errors back when doing this as well. Mostly stating Union is not supported.

I found this in the docs for UNION ALL

The “UNION ALL” operator can be used to fuse multiple queries together. Their results will be concatenated, and each
query will run separately, back to back (not in parallel). Druid does not currently support “UNION” without “ALL”

Thanks in advance. I am hoping this is just a syntax issue but I am not sure.

Hi Brian. I believe your issue is stemming from the fact that an IN statement can only have one column selected in the subquery. Something like this should work.

select “property_a”, “property_b”, “property_c”
from “datasource”
where “property_a” IN (
select “property_a”
from “datasource”
where “property_b” IS NOT NULL
)
GROUP BY “property_a”
LIMIT 10000

How much more complex is the query than the example above? I ask because this query can be written in a single statement and wouldn’t incur the overhead of having to broadcast the results of the IN subquery.

Cheers!

Jon - Imply Solutions Engineer

Hi Jon.

thanks for the quick reply. I did have luck with this and the performance was not too bad. As far as complex I am sure it will grow with time. Complexity is on the horizon but for now this will work. Can you share some ideas on merging?

The challenge I am running into is this. Simple dimensional queries have been super fast and very reliable. But we do have instances where we need to stitch the data points together. In these cases we always have one guaranteed defined sessionID. This sessionID is carried across the entire user session and tracked in each segment. So every segment we write out has a sessionID. Other properties/dimensions may or may not have data. I would like to use this sessionID has a link or primary key of sorts to stitch together data points to generate reports or impact real time queries on our sites.

I am running into challenges when some segments are populated that could impact other segments but the only way to aggregate is via a sessionID. I am trying to figure out the fastest and easiest way to perform these in Druid. I know we will need more complex reports or even real time queries that need to supply this data. As we grow our segments my concern is scale.

Here is one more use case.

I have certain modules on our site that I want to track. I have certain orders on a site that I want to track. These modules may be clicked on at some point during the session. Later in that session a user may complete an order. One guaranteed value across all of the segments during the session is sessionID. If I query in druid against sessionID with orderIDs I get a nice list. I would then need to query against the same data set of those sessionIDs to get all of the engagement of those earlier clicked on modules. A straight select on these dimensions without that query of queries does not seem to work based on the data being flattened out in our segments. the module data is disjointed from the order IDs without sessionIDs. So I need to get the sessions based on orders and then run a query to get the modules. In a normal SQL type setup I am confident on how to do this. Just a little hazy on Druid. I am probably missing something simple but I am not sure.

Not sure if this makes sense. I am still relatively new to Druid but so far have been very impressed. The performance is outstanding.

Thanks

Brian