Doubts on Efficient Usage of Druid Cluster in Production

Hi Folks,

We are in process of setting up Druid in our production environment. In that process, we have came up with few questions & understanding those will really help us efficiently manage usage of our production cluster.

Use Case: We are setting up Druid to ingest all the user events occurring on our platform. We need the events to be streamed in real-time in Druid. For doing the same, we are planning to keep separate datasources for each event viz. event wise datasource. On this ingested data, we will be running ad-hoc & analytical queries.

Adopted Solution: We are planning to use Kafka Indexing Service(KIS) for fulfilling the ingestion use case mentioned above. As we know, with KIS we need to have separate worker for each datasource i.e. as per our configuration each supervisor spec consumes messages from a topic which has only events of a specific type (like say login events). Essentially, by going with this configuration we will have as many number of workers(this also depends on taskCount & replica property of supervisor spec, please consider both as 1) running on Middle Manager nodes as the number of events i.e. almost 100 in our case. Also, a point here to note is that not every event is generated at high load & which means many of the workers will remain idle for most of the time.

**Questions: **

  • Is above explained solution recommended to go with for such mentioned use case?
  • If not, should we think of ingesting all of the data in 1 datasource with dimensions of each event defined in supervisor spec? And, then reindexing the segments to create new datasources based on event’s ‘name’ dimension value to filter & create event wise datasources.
  • Is this possible with reindexing task?
  • If possible, is it recommended to use in production?
  • Note: With this approach, as in our case there are 100 events & each event has let’s say 20 dimensions. There will be almost 2000 dimensions in resultant datasource. As we know, Druid is columnar data store this shouldn’t be of much concern. Although, please let us know if anyone has faced issues with the similar kind of configuration.

Also, please let us know if we are missing on something & there is any other better approach to achieve the same mentioned use-case.

Thanks!