My most important use case for Druid is determining statistics on whether or not tasks are late or on-time. Druid only accepts data during a fixed windowPeriod. When determining whether or not a task is on-time, late, or incomplete, I need to look at two dates: the due date, and the completed date. The due date is available in advanced and can be batch loaded right before or after the actual due time. The problem is that completion can happen days, weeks, months, or more before or after the task is due.
I thought about storing the completion as a separate event from the due time, but as far as I understand Druid, that makes it impossible to query efficiently. Druid queries based on time segments and doesn’t really have a concept of a join or lookup based on time clustered index with id. I have no way of efficiently asking for all of the due events for this window and their corresponding completion events.
If I put them on the same event, the window period will get unruly really quickly. Then I’m forced to constantly reload all segments of the data forever, which seems like a lot of scripting for me, and a lot of wasted processing power. I know that’s how some people warehouse, but another system might handle either the ingest or the querying a lot better.
Do you think I’m missing something? Is there another/better way to handle this in Druid?