I am using Python API to run ingestion tasks (in indexer).
It runs great and I am then checking the status of the task before I continue with further processing. But I found that the ingested data are not immediately available after the successful completion of the ingestion task.
I was/am looking for a way to check that the data (segments) are available, so I can start querying them. For now I know it takes app. 60 seconds for the segments to be available, so I set 300 seconds pause before continuing. But of course, I would like to start as soon as possible and also avoid that the data (that are increasing in volume) will one day exceed the 300 seconds limit and thus my post processing will be broken.
Any idea how to make sure that after ingestion task I have the data available using some API?
Thanks in advance!
You can query the loadstatus endpoints on the coordinator to
understand if all of the data is loaded:
Fwiw, the 60 seconds that you are experiencing is a function of the
loop delay on the coordinator and not entirely a function of the size
of the data. Even if the data set increases, the coordinator will
assign the segments and the amount of time to load the segments will
depend on the number of historicals that you have. I.e. more
historicals should spread the load across those machines and will
potentially still keep it within your sleep. Still, the loadstatus
endpoints are your best bet.
Thanks for the quick and excellent answer! Superb … that is exactly, what I was looking for.
Great and thanks again!!!
I created a story in the druid backlog for this a while back, at the request of a druid developer: https://github.com/apache/druid/issues/5721. I had hoped I might get a chance to work on it, but I haven’t. We poll the loadstatus endpoint, and it works OK for us.