Hi, I tried to find cumulative sum , this works in small data set only, but it shows error in large data set. Could you please help?
select c1.*,Cum_Sum from “50mbcsvdata_auraa_net_staging” c1
(SELECT a.bnx_id as ID,a.tripduration as Student_Trip,SUM(b.tripduration) AS Cum_Sum
FROM “50mbcsvdata_auraa_net_staging” a, “50mbcsvdata_auraa_net_staging” b
WHERE a.bnx_id >= b.bnx_id
GROUP BY a.bnx_id,a.tripduration ORDER BY a.bnx_id) c2
on c2.ID = c1.bnx_id
Error is :
Error: Resource limit exceeded
Subquery generated results beyond maximum
you can set druid.server.http.maxSubqueryRows to a larger value in the broker run time properties. Having said that this query will perform poorly. There is no good of way of doing cumulative sums in druid at present
The problem is that the self join is a cartesian product, so as the dataset grows, the join result grows exponentially. Like @Vijay_Narayanan1 suggested “at present”, there is work going on to enable Window functions within the MSQ framework, but there is no defined release or date on that yet.
In the meantime, you might want to try the community extension -Moving Average Query · Apache Druid which provides some limited window capabilities resolved on the Broker. The advantage over your approach is that it does not use a self-join to do it, so it does not have the data explosion problem. This approach however is still throttled on the Broker in which a single thread finalizes the calculation, so resultset size coming back from historicals is still a concern.
Let us know if you try it and how that goes.