[druid-user] Re-Indexing Data from Realtime Jobs

Hi,
Druid uses MVCC and uses the latest version segment available for any time period to serve queries.

For reIndexing you have two options -

  1. ReIndex from the data already present in druid. - you can use it to strip some dimensions/rows in your existing data based on some filter conditions.

  2. indexing your raw data again - It will create segments with newer version and overrides any older version segments.

It looks you want to index your raw data again with modified schema (added dimension), you can run a hadoop index task to do this.

Hi Nishant,
thanks for the fast reply!
So if I reindex the raw data in a hadoop index task and use the same dataSource as in the realtime task, the segments are overridden with the new dimensions on the given interval?

Regards,
Constantin

correct.

Quick question, Nishant. I am working with ingestSegment for the first time, and I was wondering: is there a reason for Index to not have the ability to specify requiredCapacity vis a vis realtime’s resource: requiredCapacity? I am finding it difficult to manage the resources used when ingestSegment sets requiredCapacity to 1 while using huge amounts of resources.

Thanks,

Paul Otto

I went ahead and added the ability to provide a resources node to Index tasks, and submitted a pull request. Take a look at PR #1630 (and a backport patch to 0.7.x - PR #1631).

  • Paul

Thanks Paul, we’ll take a look and respond.

Hi,

I’ve a very specific use case in a single machine small setup (data is max 1GB) :

Everyday a new batch file is ingested. We need to drop all old segments and create new ones everyday when this data is ingested.

We’ve tried following 3 approaches :

  1. Ingest this as new task on daily basis but this creates new segments everyday. Segments folder size keeps growing. Older segments data was retained and hence it could created disk issue later on.

  2. Deleted “segments” and cache folder before ingesting, and then ran the indexing task. This created segments for all data but queries were returning stale data. Probably something somewhere is cached including segments details.

  3. Restart Druid and then ingest. Works fine but does not seem like a good option.

Could someone suggest the right approach to handle this scenario?

Thanks.

Jitendra