Ingesting historical/legacy data to an existing datasource

Hi,

We have been using Druid for some time to handle real time ingestion of data and that has been going well.

Now I need to add previously missing (prior) data for a development partner we have and we have taken his data and reformatted to fit the “real time messages” we have been ingesting.

This old data:

  • is a part of data in an existing datasource.
  • spans several months of data
  • is not strictly ordered (not in 100% chronological order)
  • does only create prior segments (no overlap in data)
  • is stored as json files (array of entries)
    Can someone please outline for me what needs to be done to index this so that it gets indexed and added to the existing datasource.

Best regards,

-Stefan

The simplest path is to use Hadoop jobs to ingest in batch.

1) Store the data in new-line delimited JSON files on a file system
that hadoop can access
2) Run the HadoopDruidIndexer over the data (this can be done using
the cli or using a task), your spec should specify the dataSource name
that you are already using for real-time

The data doesn't have to be in chronological order, the hadoop job
will handle shuffling it around

One thing to make sure of is that the old data is everything. That
is, when you run the hadoop druid indexer it will not merge with any
data that already exists in segments, it will take the data as it
exists, assume it is all data for the given time period and add it to
the Druid system.

Upon success of the job, the coordinator will start assigning the
segments and data will be available.

--Eric

Thank you Eric.