Incremental ingestion

We have a pretty long lived cluster of Druid and would like to add additional data as a backfill. Is there a way to avoid having to reload the original data? Right now if we try to load these additional “rows” they end up replacing the existing data but we would like to do it incrementally. For example:

Existing data:

A1,B1,C1,1,0,0

A2,B2,C2,1,1,0

A3,B3,C3,1,0,1

And we want to add the following:

A4,B4,C4,1,0,1

So the following data in Druid is:

A1,B1,C1,1,0,0

A2,B2,C2,1,1,0

A3,B3,C3,1,0,1

A4,B4,C4,1,0,1

When we do it now using an “index” load from S3 we get the final set to wipe the original data:

A4,B4,C4,1,0,1

Is there a way to load the new data as additional into the same source over the same timestamps?

I think I may have found the answer in the docs. It looks as if the Append Task is what I want. Will give it a shot and report back.

Turns out it didn’t quite work for me. Anyone have any experience doing this?

http://druid.io/docs/0.9.0-rc3/ingestion/update-existing-data.html

Hmm - looks as if there’s nothing equivalent for 0.8.3 so must be a good time as any to upgrade.

Thanks for sharing that link.

-Dan

So we successfully upgrade to Druid 0.9 and I tried messing around with this multi ingestion.

Our setup is a bit weird though - we’re doing a realtime ingestion but using a static-s3 firehose and I’m having a hard time figuring out the appropriate multi configuration to use.

My ideal scenario is being able to use the data already loaded (which seems to be what the example is doing) as well as another file on S3 (similar to how we’re currently doing the realtime/static-s3 ingestion). Is there an example of that being done?

Thanks,

Dan

Hi Daniel,
multi inputSpec works with batch ingestion and is the recommended way to append to historical data.

However, if you really want to do it via realtime ingestion and firehoses.

You can try using a a combining firehose to combine data from multiple firehoses in your case they would be ingestSegment firehose and static-s3 firehose.

I think your firehose would look something like -

“firehose” : {

“type” : “combining”,

“delegates” : [{
“type” : “ingestSegment”,

“dataSource” : “ds”,

“interval” :
},

      {

“type” : “static-s3”,

“uris” : […]

      }]

}

docs for these firehoses can be found here -

http://druid.io/docs/latest/ingestion/firehose.html

Hope it helps with your use case.

Nishant - thanks for the reply and I’ll give it a shot!

For context I inherited this project from someone and am just figuring out how Druid works. So far it still feels as if there are a lot of black boxes but I’m starting to get the hang of it. One thing that’s confusing to me is why static-s3 fits into this idea of a realtime/firehose approach. If it’s coming from S3 shouldn’t it be non-realtime which is batch?

I wasn’t able to find the docs that do batch load from S3 though.

-Dan

I agree for most of the use cases ingesting from s3 makes more sense for batch use cases only,
I believe the firehose is for people who do not have hadoop setup and wants to ingest only small amounts of data using indexTask.

For ingesting data from s3, druid leverages standard hadoop s3 InputFileSystem.

e.g in the example given here http://druid.io/docs/latest/tutorials/tutorial-loading-batch-data.html just modifying the paths to “s3://bucket/sampleFile.json”

and specifying your s3 keys and settings in hadoop config files should work.

In case you also want to configure s3 as your deep storage the required settings are documented here - http://druid.io/docs/latest/dependencies/deep-storage.html

Yes - we already have that working but it’s loading it from S3 via the stream ingestion rather than batch mode. It seems that we never got the hadoop piece set up and we’re using the firehose approach. Ideally we’d switch to hadoop but in the short term I need to augment a bunch of the data that was ingested via S3 firehose. I will take a look at Nishant’s approach.

Thanks.

Great news! I was able to use Nishant’s example of a combination firehose to get exactly what I wanted. Thanks for the help.

-Dan