How to deal with similar (not duplicate) data?

I see from this thread that Druid merges duplicate data in a certain time period: https://groups.google.com/forum/#!searchin/druid-user/duplicate/druid-user/HMWnt66wqqQ/Kg45ZUjK9A4J

I’m ingesting visitor data to a site that gets a huge volume of traffic (thousands of events per second). The data were tracking, however, is going to have a large probability of being similar… for example:

referrer: ‘google.com’, device: ‘mobile’, slug: ‘test-post’

What is the preferred way to make sure no data gets dropped for being “duplicate”? This is just web traffic data so theres no real unique index i could use

Hi,

You can add a Count aggregator which maintains how many times an event is seen.

that way during rollup information on how many duplicate events you got will be maintained.

Also see “Counting the number of ingested events” here: http://druid.io/docs/latest/ingestion/schema-design.html