Help on one use case to use Druid

Hey,

we currently are using Oracle star schema as our analytics platform. as our business grows continually, we run into many issues like performance, data loading.

we are seriously evaluating Druid as our future analytics platform. it works well for most of use cases, but we run into an issue for the following simple scenario.

Use case: calculate average headcount for each quarter for the last two years

Following is the query and the current output. fact_position is an effective dated table, it has record effect start & end date (rec_eff_strt_dt, rec_eff_end_dt), dim_day is a dimension table that has calendar information.

is there any way to support this in Druid? we are okay to de-normalize our tables, but i feel de-norm or lookup join won’t work here. do we have to wait until Druid fully supports join?

*SELECT *

COUNT(1) / ( ( MAX(day.day_dt) - MIN(day.day_dt) ) + 1 ) AS “Average Headcount”,

day.quarter period

FROM

fact_position pos,

dim_day day

*WHERE *

pos.status = 1

AND day.day_dt BETWEEN pos.rec_eff_strt_dt AND pos.rec_eff_end_dt

AND day.day_dt >= TO_DATE(‘2019-01-01’, ‘yyyy-MM-dd’)

AND day.day_dt <= TO_DATE(‘2020-01-08’, ‘yyyy-MM-dd’)

GROUP BY

day.quarter

Average Headcount Period

14502 Q1-2019

14502 Q2-2019

14502 Q3-2019

14502 Q4-2019

14502 Q1-2020

Thanks in advance

Leon Smith

Hi Leon:

Good news is full JOIN support on Druid is on the road. https://github.com/apache/druid/issues/8728

But why do you think LOOKUP would not work? How big is the dim_day table?

Thanks

Hey Ming,

Lookup won’t solve this issue. This is an inequality join. Lookup will work for some equality joins

Hi Leon,

This sounds like a good fit for a timeseries query. If your datasource ingest spec has a queryGranularity of ‘DAY’ and rollup=true (that’s applicable for however many dimensions you have), the query below should work. It would also enable you to summarize at the day, week, partial weeks, or any other ISO-8601 period. In addition, if you’re interested in unique people that made up that head count in each quarter, that would be possible too with a ThetaSketch. Hope this helps.

{

“queryType”: “timeseries”,

“dataSource”: “datasource_name”,

“intervals”: “P2Y/2020-01-20”,

“granularity”: “quarter”,

“filter”: {“type”: “selector”, “dimension”: “status”, “value”: “1”},

“aggregations”: [

{

“type”: “longSum”,

“name”: “total_headcount”,

“fieldName”: “head_count”

},

{

“type”: “count”,

“name”: “total_rows”

}

],

“postAggregations”: [

{ “type”: “arithmetic”,

“name”: “average_headcount”,

“fn”: “/”,

“fields”: [

{ “type”: “fieldAccess”, “name”: “postAgg_total_headcount”, “fieldName”: “total_headcount” },

{ “type”: “fieldAccess”, “name”: “postAgg_total_rows”, “fieldName”: “total_rows” }

]

}

]

}