Help for druid ingestion

Hello guys,

Can you please help in the below scenario where events will come as

{“timestamp”:“2020-01-01T01:01:35Z”,“key”:“1”,“value”:20}
{“timestamp”:“2020-01-01T02:01:35Z”,“key”:“1”,“value”:1}
{“timestamp”:“2020-01-01T03:01:35Z”,“key”:“1”,“value”:35}
{“timestamp”:“2020-01-01T04:01:35Z”,“key”:“1”,“value”:31}
{“timestamp”:“2020-01-02T03:01:35Z”,“key”:“1”,“value”:35}
{“timestamp”:“2020-01-02T05:01:35Z”,“key”:“1”,“value”:29}
{“timestamp”:“2020-01-03T04:01:35Z”,“key”:“1”,“value”:31}

{“timestamp”:“2020-01-01T03:01:35Z”,“key”:“2”,“value”:35}
{“timestamp”:“2020-01-01T04:01:35Z”,“key”:“2”,“value”:31}
{“timestamp”:“2020-01-02T03:01:35Z”,“key”:“2”,“value”:35}
{“timestamp”:“2020-01-02T05:01:35Z”,“key”:“2”,“value”:29}
{“timestamp”:“2020-01-03T04:01:35Z”,“key”:“1”,“value”:31}

The desired ingested events would be the event with max timestamp of the day only as below. As in the incoming events there are multiple events for same day for key (1&2).

{“timestamp”:“2020-01-01T04:01:35Z”,“key”:“1”,“value”:31}
{“timestamp”:“2020-01-02T05:01:35Z”,“key”:“1”,“value”:29}
{“timestamp”:“2020-01-03T04:01:35Z”,“key”:“1”,“value”:31}

{“timestamp”:“2020-01-01T04:01:35Z”,“key”:“2”,“value”:31}
{“timestamp”:“2020-01-02T05:01:35Z”,“key”:“2”,“value”:29}
{“timestamp”:“2020-01-03T04:01:35Z”,“key”:“2”,“value”:31}

Although we can achieve the same with queries itself but we don’t want to store the unnecessary tuples.

Although I have tried the rollup with segment granularity as day, and have tried the maxTime aggregator with druid-time-min-max, which allows me to get the max timestamp of the day, but this will set the maxtimestamp for the particular day in all the events. But it won’t filter out earlier events, as we only want to consider the event with max time stamp for day.

Please help me with right strategy.

Thanks in advance
Kuldeep Gaur

longLast aggregator works perfectly but can’t be used in ingestion spec. It’s throwing the error message during segment publication as

java.util.concurrent.ExecutionException: java.lang.RuntimeException: org.apache.druid.java.util.common.UOE: LongLastAggregatorFactory is not supported during ingestion for rollup

I don’t think this is currently possible OOTB but I think I have seen someone implement a custom aggregator to do this. If you want to try this, here are some examples:

https://github.com/implydata/druid-example-extension

Is there a way for Druid to know that it is the last event for day in question? Ie like a filter condition?

If not, it sounds like Druid would only know once the day is finished? So you would be looking for a solution where Druid would look back at the data and then say - oh - that was the last event?

I’m wondering (I mean IMAGINING!!!) could you have a really short-lived data source that contains all the incoming rows, and then at about 00:00:03 (!!!) you use something like Airflow to schedule a Druid-to-Druid ingestion task that will do LAST on all the dims??? Hmm… :boggles: