Hi,
We have a monthly table tracking user hits with about 500K records per month. We want to determine if users who accessed a particular entity last month accessed the same entity the next month (stickiness). Since Druid doesn’t support an “exists” query, we are currently using a self join for this.
Query for last 6 months would be is of the form
SELECT
curr.datetime,
curr.entity,
(count (distinct prev.user)*1.0 / count (distinct curr.user)) as STICKINESS
FROM
(
SELECT DISTINCT
__time as datetime
, user
,entity
FROM
hits_monthly
WHERE
__time between time_parse(‘20221101’, ‘yyyyMMdd’) and time_parse(‘20230401’, ‘yyyyMMdd’)
) curr
LEFT JOIN
(
SELECT DISTINCT
__time as datetime
, user
,entity
FROM
hits_monthly
WHERE
__time between time_parse(‘20221001’, ‘yyyyMMdd’) and time_parse(‘20230301’, ‘yyyyMMdd’)
) prev
ON
curr.user = prev.user and curr.entity = prev.entity and prev.pterm = TIMESTAMPADD(month, -1, curr.pterm)
GROUP BY 1,2
Above query takes ~3.5 min. We tried increasing the buffer size from 500M to 1G but that did not make any difference to the execution time. So it looks like the join (and the inherent data shuffle) is what may be taking a lot of time.
What would be a better approach to extract this info? Would a multi-stage query work better? I did try to write one but wasn’t successful as the console kept throwing parse error for the keywords PARTITIONED & CLUSTERED.
Config:
Druid version - 24.0.2
4 Historical nodes - 2 with 16 CPU and 2 with 8 CPU.
MergeBuffers - 15
Buffer Size - 1G
druid.sql.planner.useApproximateCountDistinct - false → We need exact count
druid.sql.planner.useGroupingSetForExactDistinct - true → As we are getting count of different columns in same query