Druid query help


I have a datasource inside Druid where I keep all type of events and profile in same data source. Now I want to get data from datasource as given below -

SELECT gid , event_action , event_category , email , current_subscriber_status , subscription_start_date , subscription_end_date,
platform , user_login_status_hit , user_login_status from Data_Prod where
(( event_category like ‘KKKK’) OR ( event_action like ‘LLL’))
AND gid in (SELECT distinct gid from Data_Prod where email is not null)

Issue: I get email as null since for this event_category and event_action email is null.
gid is common for all the events, email comes in profile events.

Can you please suggest how can I get the email too in the results?

Thanks and Regards
Amit Srivastava

Hi Amrit,
what happens when you query just the last part of your query SELECT distinct gid from Data_Prod where email is not null

In recent versions, you could do a join, but that table looks big, so that’s probably not going to perform well.
What about adding one more row to the results, if the profile event is one row per gid, by also selecting the row that contains
the email? (Then, if necessary, post-process to format or whatever’s needed for display.)

Hi Matt,

When I execute the following queries, Druid returns data -
Query 1: select distinct gid from Data_Prod where email is NOT null

Response: returns data

Query2: select count(distinct gid) from Data_Prod where email is NOT null
Response: As of now approx 1.0 Million (increasing daily)

Query3: select count(gid) from Data_Prod where email is NOT null
Response: As of now approx 2.5 Million (increasing daily)

Amit Srivastava

Hi Ben,

Yes, with the help of join results can be obtained, but data is increasing day by day approx (20 Million per day) so don’t know how long the join will work.
gid is not unique in events. Profile events are not one row per gid.
example. If an user does multiple activities or updates his/her profile will generate one row for each activity in the datasource with the same gid.


Amit Srivastava

It sounds as if the current method might be the best one, in that case, imo.

Hey Amit :slight_smile: I am late to the party!

Do you have the option of pushing the state data (ie. current email address) into a source that you can use as a JOIN lookup?
And secondly, what analytics will be calculated? I ask in case there is another way to get the statistics out that you are looking for.

  • Pete

Lookup sounds like it could be a great idea, so I retract my earlier comment. :slight_smile: Maybe just gid (key), dummy (value), so you can join with SELECT gid from lookup-table.