w.r.t this :
-> For replays : I can not use Kafka indexing service, as data coming in is either of the two:
a) get the complete data for specific timesegments to be replaced
b) get partial data to be merged with data that was earlier ingested. The data that was earlier ingested is part of a durable Kafka Topic, so I can lookup there and merge the with partial data and create a HDFS file for batch ingestion.
I understand this is more outside of Druid, but still the solutions would closely depend on what Druid provides. So want to understand what capabilities that Druid provides which can address these cases and how best to implement them.
This is case of data replays (data fixes).
My original data is ingested in Kafka. Occasionally, I will get an upstream request for “Replay/Data fix”. The replay can be either ve
a) complete segment’s (in terms of druid) worth of data, in which case I will need to create a new HDFS file and ingest/replace to Druid.
b) partial segments data. In this case, I will need to merge the data in Kafka (original data) with the replay data and create a new HDFS file and do a full segment replace on Druid.
Moving data from Kafka to HDFS (with additional complexity of merging) is an additional moving part in the architecture - and wondering ways I can either remove that requirement or reduce the risks of issues there. Any thoughts ?