Druid reindexing task

Hi everyone,

We’ve been trying out Druid at our organization. We launched a very simple cluster, the one found at the docs as an example (Coordinator & Overlord in a m3.xlarge, Historical and MidMan in a r3.2xlarge, and the broker in another r3.2xlarge). We have a few doubts concerning data ingestion:

We use S3 to store our data files. In order to ingest data to Druid, we’re trying out a recursive Multi ingestion spec which ingests the new files as they arrive (by using a queue with the path of the file in S3), as well as the already ingested data source. So, every time a new file arrives to S3, the system would send it to the indexing task which would be executed every time a file arrived. Is this the optimal way to do this? We’re trying to see all possible options, but as for now, this seems to be the best for our scenario.

We’re guessing that the first indexing task with our already huge data will probably take really long, but the goal is to make the indexing task be much faster by adding small new pieces of data as they arrive.

Thank you very much.

If you can treat the data as a stream ingestion process I think you’ll have a better time. And use the bulk batching as the fixup part of the lambda architecture.

You can look into delta ingestion as well: http://druid.io/docs/0.9.1.1/ingestion/update-existing-data.html