Druid native ingestion

Hello everybody

I have an input source from Kafka. It is working fine. (with granularity of one-hour). I call this data source X.

I want to create another input source, with lower granularity (of one day), I want to use the segments from data source X and apply a new roll-up and create a new data source named Y. Without touching anything in data source X (I want it to be the same, it is used by many people).

Using druid native ingestion:

  • I guess this is possible, without touching the data source X I create a new data source (Y)?

  • And it is also possible to apply a different roll-up? I guess so

  • My main issue is setting the interval. I want to be able to include all segments of my data source X for ingestion by data source Y and keep this data source (Y) constantly being fed by new upcoming segments of data source X.

Interval allows me to set an interval, but it doesn’t have any option like “Include Future” which is found in “retention rules” section.

What can I do to have data source Y constantly ingesting segments from data source X?

Relates to Apache Druid 0.20.0

Hi Mostafa,
One possibility would be to also ingest from kafka for data source Y and just do the different rollup in that ingestion. This approach will keep both data source X and data source Y up to date in real-time. It will require more work and CPU/memory usage that goes with it on the Middle Managers.

I don’t know of an automatic mechanism to do this in incremental batches. But once you have the ingestion spec for ingesting from data source X into data source Y, it should be fairly simple to use some other scripting to parameterize the ingestions and have it increment the interval every time it runs.

2 Likes

Hi @mostafatalebi The easiest way to do this would be to populate source Y by using reindex from
Druid
(Source X)

In order to have this updated in real time you could handle the interval logic offline and submit an ingestion spec API call to the cluster on a periodic basis.

1 Like

Thank you guys for your answers.

IMHO simplest option is to go with another supervisor consuming from the same Kafka topic but populating a different table. But I suppose it depends how many pennies you have in your bank to pay for more cores that will watch the topic :smiley: