Ingesting Data Updates Into Druid In Real Time

I have a use case where in we want to ingest several thousands of records in druid in real time. I would like to get inputs from the user group on what would the best approach to follow with druid.

Historical data would be ingested into druid using batch ingestion (hadoop MR job). The user interface allows the user to perform OLAP analysis on this huge data set along with view where the user can update few records. Adjustments on some records triggers updates of other records - kind of data chaining. We would want to load these updated records into druid in real time and make them available into druid for OLAP analysis.

As I understand, real time ingestion is about ingesting real time data which does not fit our use case where the data is still historical data (with old timestamps). Probably one approach would be to trigger deletion of the old segment (I am not sure how can this be done though?) and ingest the updated data along with other records again in druid. I assume this would have to be done with the hadoop MR job but can this be done in real time (with a response time < 2 sec)?

Are there any other approaches?


I don’t think you’ll be able to do this update <2 secs with a MR job and the way Druid loads segments. You can look into using lookups for updates:

Store mutable values in external store and use lookup join capabilities to do the update.