Hadoop batch ingestion

We have raw data stored and updated in hdfs cluster, we want to input the raw data to the druid cluster for analytics purposes.

Raw data in hdfs is stored in proper path structure that is expected from druid. To ingest the data from hdfs, we can simply do a batch ingestion after some interval of time, let’s assume that we do batch ingestion every hour, we can simply do a batch hadoop ingestion task every hour - with interval set as the current hour, but the problem arises due to delayed events, there is a possibility that during the current hour, any of the previous hour data can also be added to the hdfs, now a simple hdfs batch ingestion with the current hour interval won’t work.

Is there any mechanism provided in druid that allows for this kind of ingestion from hdfs?

Hey Subramani,

Old data being added shouldn’t cause the task to “not work”. Data that falls outside the interval in the ingestion spec will simply be thrown away afaik.

You could re-index the entire dataset for earlier intervals after late data has arrived, recreating the segments.

Alternatively, if you’re able to discern between events that are already ingested and new events (e.g. piping to a different folder) you could take advantage of Druid’s delta ingestion feature, although it’s worth taking some care to avoid unbalanced segments. Some docs here - http://druid.io/docs/latest/ingestion/update-existing-data.html

Best regards,

Dylan