'First' and 'last' aggregations for string types

Hi,

I’m trying to find the first and last dimension value (dim3) when grouping by other dimension values (dim1 and dim2). Dim3 happens to be a string, and I couldn’t find a solution for this on the group.

I know there’s a pull request outstanding for ‘first’ and ‘last’ aggregation type, but that’s only for numeric values (unless I"m mistaken). I tried using the Javascript aggregation, but again it seems like the fnCombine and fnAggregate functions must return a numeric value. How do I achieve this for non-numeric dimensions? Say I have a list of events with ‘app’ and ‘user’ as dimensions, and I’d like to find the last ‘app’ accessed by each user along with some other metric like number of transactions. By ‘last’ I mean most recent timestamp. If fnAggregate, fnCombine and fnReset allowed returning arbitrary objects or arrays, I could return a timestamp with each value, and then compare timestamps during aggregation or combine stages.

Does anyone know of a good way to meet my requirement?

thanks

Nandan

+1

Hi Nandan, can you do this with multiple queries? Similar to the sub-table functionality in Pivot?

Yes, that is how I plan to solve this problem.

thanks

Hi!!

I have this exact same requirement. Were you able to solve this?

Thanks!

My approach was to run a second druid group-by query in addition to the one I was already running. So in my example, I had the requirement of finding the last value of ‘Y’ per ‘X’, in addition to other metrics per X. My second query involved adding ‘Y’ as the second group-by key. Additionally, in this second query, I switched to a finer granularity compared to the original query. For example, if my original query was over all time, I would drop granularity to some ‘acceptable’ finer granularity like hourly. So the results are effectively grouped by X, Y and time bucket - and sorted by time so I can find either the ‘first’ or ‘last’ value in time. So you can walk the result set of this second query and for each X, pick out the first or last value of Y and merge them back into the results of your primary query. The time granularity you pick in your second query is important, because it’s a tradeoff between the size of the result set and accuracy. For example, you could have multiple values of Y for a particular X within an hour, and you’d have to pick arbitrarily between them. Your choice of time granularity also depends on what time resolution your data is stored at.

Hope this helps. This was the best solution I could come up with. It’s not super-efficient, but acceptable in my use case.

Hi Nandan! I understood your answer, it was very clear.

I think this method you proposed works when you need just 1 dimension aside the one that you are grouping by. I need to group by one dimension and get the last value from three other dimensions. For example group by a user dimension, and get the last values of “dim1”, “dim2” and “dim3” for every grouped user.

Seems that it correctly covers your use case, I will have to keep digging for mine!

Thanks for your quick response and detailed explanation.

Regards,