Using druid with high-cardinality multi-valued fields

I’m evaluating Druid as a possible replacement for our current clickstream summarizer/query tool. We have a real-time stream of events from our sites (clicks, pageviews, etc). Each is event is associated with zero or more A/B test variants, and has a bunch of associated metadata, such as the type of page, the visitor’s ID, etc.

We’re currently using an 3rd party vendor to store our historical clickstream summaries. Their tech is based on InfluxDB, and is struggling to handle the number of unique timeseries that are generated by our clickstream. The main problem is that we have about 200 test variants running at any given moment right now, and that number will only increase over time. Multiplying the test variants across all of the other tags (display brand, language, event name, event category, etc) we have, the ‘unique metrics’ cardinality is pretty huge and Influx is struggling.

We need to be able to slice all of our event counts by any given test variant, so we can ask questions like, “Did test xyz variant 0 cause more users to click on profile images more frequently than test xyz variant 1? Did it affect English-speaking users more than French? Was there any difference in this effect between brand A and brand B?” and so on.

I’ve been reading through this message group and the Druid documentation, and it seems like Druid would have less trouble with this cardinality explosion than Influx. Does this sound like a reasonable expectation to the experts here?

Thank you for your time!

Hi Kevin,

Yes, this use case is one we’ve seen some notable companies do at fairly large scale. These type of slice and dice metrics at large scale for event streams was what Druid was designed for.

If you are getting started with a PoC, I’d recommend starting with

The Druid getting started docs are based off of the Imply docs, but you might have a much easier time with the above quickstart.