Batch ingestion spec to wait until all the segments are published and available

We are doing batch ingestion to support different time period rollups (i.e. Daily/Monthly/Yearly) on same data and storing that under _daily, _weeky, _monthly data sources.

We implemented this as kind of a workflow where first daily batch ingestion task is submitted and once complete, then submit monthly task (feeds from _daily datasource table).

Now what we have observed that is there is a delay between when druid task returns SUCCESS and when the segments produced by the tasks are actually available when queried right after using /datasources/tmp.test_segment_daily/intervals/ API. This results the next task does not pick the required interval updated by the previous one.

Is there way we can force task to only return successful once all the segments produced by it is available? Or any API which we can query to make sure that these segments are available.

Otherwise the only solution I can think of is to put some arbitrary wait between these tasks

We will appreciate any help/suggestion.

Regards, Amit

So, first question… why have the different datasources? Druid is very performant on roll up, so you really only need to use the lower level of rollup and run all your queries on that. Def a different mindset than traditional databases.

But to your actual question… you want to be alerted when the new segments are available for query? Is that correct?

Cheers,

Rachel

I think using the same table to keep different rollups may give us the problem. e.g. we keep daily rolled up data for 180 days now to get last 6 month of data rolled up on month from the daily data at query time may not give us the desired performance. Some of the dimensions we have comparatively high carnality and can result in significant data even after rolling up on time.

Second yes, I see the task logs and the task finished after putting the segment in deep storage and then handing over to the overlord to load those segments and make it available. Between those two times (after handing over to the overlord and overlord loading those segments), I am not able to find any API which gives me details about those segments.

Here are the logs (redacted some of the messages)

2020-07-31T17:38:18,990 INFO [task-runner-0-priority-0] org.apache.druid.indexing.common.task.FiniteFirehoseProcessor - Pushed segments[[DataSegment{binaryVersion=9, id=tmp.test_segment_2020-07-01T00:00:00.000Z_2020-07-02T00:00:00.000Z_2020-07-31T17:38:06.435Z, loadSpec={redacted}]
2020-07-31T17:38:18,994 INFO [publish-0] org.apache.druid.indexing.common.actions.RemoteTaskActionClient - Performing action for task[index_parallel_tmp.test_segment_2020-07-31T17:38:06.431Z]: SegmentTransactionalInsertAction{segmentsToBeOverwritten=null, segments=[DataSegment{binaryVersion=9, id=tmp.test_segment_2020-07-01T00:00:00.000Z_2020-07-02T00:00:00.000Z_2020-07-31T17:38:06.435Z, loadSpec={redacted}
2020-07-31T17:38:18,999 INFO [publish-0] org.apache.druid.indexing.common.actions.RemoteTaskActionClient - Submitting action for task[index_parallel_tmp.test_segment_2020-07-31T17:38:06.431Z] to overlord: [SegmentTransactionalInsertAction{segmentsToBeOverwritten=null, segments=[DataSegment{binaryVersion=9, id=tmp.test_segment_2020-07-01T00:00:00.000Z_2020-07-02T00:00:00.000Z_2020-07-31T17:38:06.435Z, loadSpec={redacted}].
2020-07-31T17:38:19,135 INFO [publish-0] org.apache.druid.segment.realtime.appenderator.BaseAppenderatorDriver - Published segments.
2020-07-31T17:38:19,135 INFO [task-runner-0-priority-0] org.apache.druid.segment.realtime.appenderator.AppenderatorImpl - Shutting down...

After the index_parallel task is done, historical will start downloading the segments. I believe you can monitor zookeeper loss and coordinator-overlord logs to see what message exchanges take place. As for APIs, I am not too sure but there was one for segment status I believe. Again, not too sure

I believe this is addressed by issue #5721? There is a new api to address this in druid 0.19: https://github.com/apache/druid/pull/9965