Preventing Duplicate Data

Hey,

I was just interested in getting some more information on how druid handles duplicated data.

Scenario: I have two realtime nodes consuming from the same kafka topic with different consumer group ids. This data is not sharded in druid. What would happen with hand off to s3? Will this data be duplicated in historical segments?

Thanks,

Hi Nicholas,

Right now, Druid does not do data de-duplication on realtime ingestion so it is possible that you will ingest duplicate events that would affect your query results. There is work underway related to improving realtime ingestion so that it guarantees exactly once ingestion so that no events are dropped and no events are duplicated. If you’re interested in seeing progress on this, it’s being tracked here:

For now, a common architecture that we use in production systems is a best-effort realtime ingestion pipeline designed to immediately make data available for querying combined with a periodic batch indexing job using a tool like Hadoop that will generate complete and de-duplicated segments which will take precedence over the realtime generated segments and provide a fully accurate view of your historical data. There’s some information on this here:

http://druid.io/docs/latest/ingestion/overview.html

Ahh thanks for the links. So data will def be duplicated at realtime. But what about historical data? when realtime nodes handoff persisted segments to s3 for coordinator/historical, will this data still be duped? is there some sort of coordination between historical nodes?

Hey Nicholas,

Actually, if the two realtime nodes both have the same shardSpec, they will be assumed to contain the same data and only one of them will be queried; so assuming no duplication happened before this point you won’t get duplicate events. When the realtime nodes finalize and persist the segment to S3, again if they have the same shardSpec, the two identical segments will have the same ID, which will result in one of them overwriting the other and you’ll still get a single copy of the data.

So Druid can handle intentionally replicated event streams as long as the realtime spec files are set up correctly to create segments with the same ID for the same set of data. If your shardSpecs were different and two differently named segments were created then you would get duplication of data.

Oh ok. so i guess i do have some sort of replication set up. I have a bunch of realtime nodes with exactly the same spec file but different shard numbers. when persisting, segment ids are the same so segments will be overwritten. which i think is ok for now, since it is a workaround for fault tolerance.

What about replication/fault tolerance for indexing service? i have been considering changing my ingestion architecture to using overlord/middlemanager because of some added flexibility. will indexing service scale peons based off of ingestion volume? can I specify what simple kafka consumers my peons can ingest from?

Hi,

If you use kafka firehose, then due to current limitations described in http://druid.io/docs/latest/ingestion/overview.html#ingest-from-apache-kafka , you can either do partitioning or replication but not both. If you need both then you would have to use tranquility with indexing service.

Now, I guess, you have 2 realtime nodes with different kafka consumer groups. That means, both are receiving and publishing same data. You would indicate that to druid by setting the “shardSpec” correctly , see druid.io/docs/latest/ingestion/realtime-ingestion.html#sharding .

In your case, both realtime nodes would set the shardSpec to…

 "shardSpec": {
        "type": "linear",
        "partitionNum": 0
    }

same partition numbers indicates to druid that they same and at query time only one is queried.

– Himanshu