Realtime ingestion of variable numbers of datasources

Hi all!

I have several questions about realtime data ingestion.

Conditions of my task:

  1. I dont use java. I have various scripts (more than 500) that produce data.

  2. These producers generate events, that i can describe as one datasource, but generally they can have different timestamp (one of them realtime, others can go with day delay or something like this). Also this events from different producers not related each other logically. Moreover, one can produce 10k per second, others less than 100. So it is seem good idea to store events from different producers in different datasources.

My questions:

  1. How could i implement realtime ingestion of various dynamic numbers of datasources?
    I see 3 options:
    - Generate new druid.realtime.specFile each time when new producer (new datasource) appears and restart realtime node.
    - Launch realtime index task for each datasource. But i didn’t manage it due to bug:!searchin/druid-development/DruidServerMetadata$20/druid-development/XwOWCw7ac9U/pduNG6vwijYJ
    - Use one datasource with extra dimension “producer_type” and tune segments shards by this dimension

  2. As i understood, submitting realtime index task turns middle manager node into realtime node. Does it mean, that if i have only one middle manager node i can submit only one realtime index task?

  3. How can i tune realtime ingestion to provide sharding by dimension? Should i use tuningConfig.partitionsSpec? Will it provide storing events in each segment shard only with one value of specific dimension?

  4. As i don’t use java, i can’t use tranquillity. Right now i just want set up working prototype. But in future I will probably wish implement my own push scheme. But there is no information about EventReceiverFirehose in documentation and examples. Is this approach deprecated? Where can i get description of right usage of it?


Hi Eugene,
See Inline

Hi Eugene,

In addition to what Nishant said, I’m wondering a couple of things: How many of your producers will produce realtime data, and how many will produce delayed data? Also, how many are producing higher rates and how many are producing lower rates? The reason I ask is because currently, Druid’s realtime indexing is not well suited for delayed data (we’re kicking around some ideas to address this) and also Druid’s realtime indexing, when done through the indexing service, needs at least one task (JVM) per datasource. This is meant to provide operational isolation between datasources, although potentially makes it tough to efficiently run a lot of small datasources.

Assuming the majority of your producers are producing realtime data, and you have a lot of small producers, it probably makes sense to use one datasource with an extra “producer_type” dimension. With realtime indexing, you can’t specifically tell Druid how to shard the data, but you can do it on the producer side. If you’re ingesting directly from Kafka, you can provide a Partitioner to the Kafka producer. If you’re using Tranquility, you can do the same thing by providing a beamMergeFn to the DruidBeams builder. If you’re using HTTP, you can do it by controlling which task you send which data to (this is what the beamMergeFn does in tranquility).

For the producers that aren’t producing realtime data, you can either set the realtime “windowPeriod” long enough to catch the delayed data, or you can skip realtime ingestion altogether and use batch ingestion.

Thanks a lot, guys for your answers!

To Nishant: You sad “Set the firehose to receiver”. I just want to specify. Does it mean this firehose should be setted in realtime index task?

To Gian: You are right: most of producers are realtime. Some of new one can be delayed and at last they catch up and became realtime too. And yes, only 5% produce a half of events. And i have almost the same conclusion. You persuaded me follow this way.