[druid-user] One time processing of data per day

Dear all,

I am new to Druid. We have currently used kafka indexing service to load data which we are getting at 01:30 hrs every day. But task is continuously running as we have configured duration of the task to be 24 hours.

We want task to load data once it’s available and then gets completed successfully. How can we configure event driven task which will load data once a day?

Request you to help/guide in this regard.

Thanks & Regards,
Shalin Shah

Welcome Shalin! And welcome to Druid :smiley:

Just to be super clear - and feel free to correct!

Kafka is your source data
Data is only pushed into the Kafka topic at 01:30
You have set up a supervisor ingestion that monitors that topic
The supervisor is generating subtasks that have a duration of 24 hours

It sounds as though you are looking for a - kind of - “batch” ingestion from Kafka that you can submit to Druid. In other words, kind of like you would submit an ETL job to just consume from Kafka, and then just finish and stop afterwards?

AFAIK the design of Druid is such that it plays to the strengths of event hubs - ie, data flows continuously. Well… at least frequently! That means Druid supervisors run forever until you stop them. The subtasks it creates run for a period, publish their data to the historical servers, and then stop. Then the supervisor creates a new series of tasks - and so on and so on and so on.

Maybe you could use the APIs to give you more programmatic control over the supervisor?
For example, maybe you could post your supervisor spec at, say, 01:00, then wait for a couple of hours, and then shut it and all its tasks down. See API reference · Apache Druid

If it were me, however, I would probably set your subtask periods back to an hour (just because that’s nice and neat) and allow the supervisor to continue running. This could set you up for a future when data comes in more frequently.

So many more questions… :smiley:

Thanks Peter. You perfectly understood our requirement. We have now configured task to run on hourly basis. There is another issue which we are observing. At 1:30 hrs we got the data from kafka which was picked up by the task for ingestion and data was available in real time. Size of data was very nominal. Around 50K rows and 24MB in total.

But we could see that segment handoff is taking lot of time and eventually it is reaching completion timeout. Hence we are losing data as it’s not copied in historical and task is interrupted due to completion timeout.

Please see below logs. Could you please guide on this?

2021-03-26T03:00:35,405 INFO [coordinator_handoff_scheduled_0] org.apache.druid.segment.realtime.plumber.CoordinatorBasedSegmentHandoffNotifier - Still
waiting for Handoff for Segments : [[SegmentDescriptor{interval=2021-03-26T02:00:00.000Z/2021-03-26T03:00:00.000Z, version=‘2021-03-26T02:15:52.505Z’,
partitionNumber=0}]]

Hi Shalin -

You might check whether any of these apply - https://druid.apache.org/docs/latest/ingestion/faq.html#my-stream-ingest-is-not-handing-segments-off

Hopefully it helps.

Thanks Ben. On further analysis we have found that no segments are getting loaded in historical for a brief period of time even though data is continuously getting ingested. Any idea around this?

I have highlighted timestamp below. All I wanted to highlight is no segment loading related loggers in both our historical nodes even though data was getting ingested. Moreover we can see historical node is dropping data on every hour or moving to cold tier from hot tier according to rules configured. So, there doesn’t seem to be any issue with historical node itself as it’s doing all other processes in pretty normal way.

FYI - we are using DRUID 0.18 version.

2021-03-26T13:25:10,009 INFO [ZKCoordinator–5] org.apache.druid.server.coordination.SegmentLoadDropHandler - Loading segment abc_2021-03-26T13:00:00.000Z_2021-03-26T14:00:00.000Z_2021-03-26T13:03:46.727Z_5
2021-03-26T13:31:01,068 INFO [ZKCoordinator–0] org.apache.druid.server.coordination.SegmentLoadDropHandler - Loading segment abc_2021-03-26T13:00:00.000Z_2021-03-26T14:00:00.000Z_2021-03-26T13:03:46.727Z_6
2021-03-26T21:34:10,307 INFO [ZKCoordinator–2] org.apache.druid.server.coordination.SegmentLoadDropHandler - Loading segment xyz_2021-03-26T00:00:00.000Z_2021-03-27T00:00:00.000Z_2021-03-26T01:35:09.679Z
2021-03-26T21:34:10,316 INFO [ZKCoordinator–0] org.apache.druid.server.coordination.SegmentLoadDropHandler - Loading segment xyz_2021-03-26T21:00:00.000Z_2021-03-26T22:00:00.000Z_2021-03-26T21:01:52.839Z

Investing this further we could see that coordinator itself is not generating any Load rule during this timeline. No LoadRule was created by coordinator between 13:31 to 21:34 even though data was being ingested continuously. Any idea why this can happen?

Also, I might be asking silly question - in coordinator logs we can always see Coordinator-Exec–0. Not able to find --1, --2 etc. Does that mean that it’s running only single thread?

Sorry to be late replying - did you track this down?