De-duplicate data

Is there any way for us to drop the duplicate data when the new records has the same timestamp and same dimension value? Thanks!

Not currently, and having the same dimension values is very easily a new event.

So, do you have any plans to support updates for existing records ?

The direction forward I’ve seen in regards to this is to move to “exactly once” processing using the new kafka ingestion at http://druid.io/docs/latest/development/extensions-core/kafka-ingestion.html

The idea is that for append-only systems, where you’re just worried about duplicates, you eventually won’t have to worry about duplicates due to exactly once stream processing (http://druid.io/docs/latest/development/extensions-core/kafka-ingestion.html + KIP-98).

For situations where you aren’t append-only, and you really do want updates, the main two ways of doing that are bulk updates (entire time ranges at once using batch ingestion) or query-time lookups (suitable for modestly sized mappings where the values might change; most useful for mapping id -> name or similar).

One approach to handling this problem would be to have an offline deduplication job and batch index back to Druid. It is a classic Lambda Architecture pattern and Druid design around it AFAIK.

Yes, that works too.