Replay

What are the recommendations on how you manage replays (segment rebuilds).
Assume I have realtime data coming in, and I need to rebuild the segments too (for fixing older data).

I would assume it is a realistic case for anyone, if so would be interested in understanding how they have handled.

Hi,

I have seen multiple use cases where the replay is achieved via batch ingestion.

Assuming the data used for replay is setting on HDFS/S3 all you have to do is to schedule a periodic hadoop index task that creates new segments with new data.

Once the task finish druid will detect automatically the existence of those new segments and will start serving out of the new data.

Please let me know if you have further questions.

Assuming the data used for replay is setting on HDFS/S3 all you have to do is to schedule a periodic hadoop index task that creates new segments with new data.

I would not have my data on HDFS/S3. Hence I am wondering if I need to still go ahead and write systems to move data from Kafka to HDFS to Druid? Can the batch-ingestion made to be work directly off-kafka handling duplicates etc?

Well sorry i miss red your question.

So first can you please explain if the new arriving data is a delta ? or it is the same old data + some modifications ?

Also what does mean kafka handling duplicates ?

Hi Slim,

The new data arriving is delta.

However incase of replays, I could :

a) get the complete data for specific timesegments to be replaced

b) get partial data to be merged with data that was earlier ingested. The data that was earlier ingested is part of a durable Kafka Topic, so I can lookup there and merge the with partial data and create a HDFS file for batch ingestion.

Also what does mean kafka handling duplicates ?

I meant : Can batch ingestion work out of Kafka directly, without the intermediate HDFS/s3. What would it take to support additional sources for batch ingestion?

Thanks

Guru

Hi Guru,

Hi Slim,

The new data arriving is delta.

Well if the data is a delta is arriving late and written to kafka then you can use kafka realtime ingestion. New Supervisor added to druid is not limited by window time anymore thus can be used in your use case without writing data to HDFS.

Please read this and let me know if you have more questions.

http://druid.io/docs/0.9.1.1/development/extensions-core/kafka-ingestion.html

Thanks Slim, couple of points -
-> Kafka Indexing service is still called out as an "experimental feature", so this leads me to “batch delta ingestion” for now? is there another option?

-> What would be the cost of the infinite window? Too much fragmentation and compaction? Can I use this for a case where I always get hugely out of order events (basically writing to different time segments) with high frequency?

-> For replays : I can not use Kafka indexing service, as data coming in is either of the two:

a) get the complete data for specific timesegments to be replaced

b) get partial data to be merged with data that was earlier ingested. The data that was earlier ingested is part of a durable Kafka Topic, so I can lookup there and merge the with partial data and create a HDFS file for batch ingestion.

Thanks Slim, couple of points -
-> Kafka Indexing service is still called out as an "experimental feature", so this leads me to “batch delta ingestion” for now? is there another option?

Well it has been used by multiple use cases so it is robust enough IMO. again i am not advocating against batch ingestion it all depends how easy/cost to set all this.

-> What would be the cost of the infinite window? Too much fragmentation and compaction?

As you said the cost is segment sizes not optimal but you can enable the auto merge option so merging will be done in background.

Can I use this for a case where I always get hugely out of order events (basically writing to different time segments) with high frequency?

it is hard to really to estimate this one but i guess you are right at some point you don’t want your task to have multiple open indexes.

-> For replays : I can not use Kafka indexing service, as data coming in is either of the two:

a) get the complete data for specific timesegments to be replaced

b) get partial data to be merged with data that was earlier ingested. The data that was earlier ingested is part of a durable Kafka Topic, so I can lookup there and merge the with partial data and create a HDFS file for batch ingestion.

not sure what is the question here ?

Thanks Slim.

w.r.t this :

-> For replays : I can not use Kafka indexing service, as data coming in is either of the two:

a) get the complete data for specific timesegments to be replaced

b) get partial data to be merged with data that was earlier ingested. The data that was earlier ingested is part of a durable Kafka Topic, so I can lookup there and merge the with partial data and create a HDFS file for batch ingestion.

I understand this is more outside of Druid, but still the solutions would closely depend on what Druid provides. So want to understand what capabilities that Druid provides which can address these cases and how best to implement them.

This is case of data replays (data fixes).

My original data is ingested in Kafka. Occasionally, I will get an upstream request for “Replay/Data fix”. The replay can be either ve

a) complete segment’s (in terms of druid) worth of data, in which case I will need to create a new HDFS file and ingest/replace to Druid.

b) partial segments data. In this case, I will need to merge the data in Kafka (original data) with the replay data and create a new HDFS file and do a full segment replace on Druid.

Moving data from Kafka to HDFS (with additional complexity of merging) is an additional moving part in the architecture - and wondering ways I can either remove that requirement or reduce the risks of issues there. Any thoughts ?

Thanks Slim.

w.r.t this :

-> For replays : I can not use Kafka indexing service, as data coming in is either of the two:

a) get the complete data for specific timesegments to be replaced

b) get partial data to be merged with data that was earlier ingested. The data that was earlier ingested is part of a durable Kafka Topic, so I can lookup there and merge the with partial data and create a HDFS file for batch ingestion.

I understand this is more outside of Druid, but still the solutions would closely depend on what Druid provides. So want to understand what capabilities that Druid provides which can address these cases and how best to implement them.

This is case of data replays (data fixes).

My original data is ingested in Kafka. Occasionally, I will get an upstream request for “Replay/Data fix”. The replay can be either ve

a) complete segment’s (in terms of druid) worth of data, in which case I will need to create a new HDFS file and ingest/replace to Druid.

Well looks like is this case if you have the entire data for a specific interval of time then yes you can push it to HDFS then send an batch index task.

b) partial segments data. In this case, I will need to merge the data in Kafka (original data) with the replay data and create a new HDFS file and do a full segment replace on Druid.

In this case if the data is a delta you have 2 choices.

1 use batch delta ingestion with the delta and the existing segments in druid.

2 merge your self the data to create a complete data set then send a batch index task.