Update Datasource - ('old Timewindow' in the past)

Hi all trying my first ‘update’ test. (this is on the latest 0.14.0 release)

The data has been streamed in via kafka-indexing-task (and the kafka-indexer task is still running processing ‘todays’ data…well there is no more active data streaming atm…but that is a detail…in reality there would be…).

Now If I try to ‘update’ some data data in the past…(I have created a ‘15’ minute window data file with rows for that time window)

My spec file is shown below…

when I run the indexing task:

./bin/post-index-task --file updates-overwrite-index.json
Beginning indexing data for test_1day
Task started: index_test_1day_2019-05-24T10:12:59.856Z
Task log: http://localhost:8090/druid/indexer/v1/task/index_test_1day_2019-05-24T10%3A12%3A59.856Z/log
Task status: http://localhost:8090/druid/indexer/v1/task/index_test_1day_2019-05-24T10%3A12%3A59.856Z/status
Task index_test_1day_2019-05-24T10:12:59.856Z still running…
Task index_test_1day_2019-05-24T10:12:59.856Z still running…
Task finished with status: SUCCESS
Completed indexing data for test_1day. Now loading indexed data onto the cluster…
test_1day loading complete! You may now query your data

This didn’t appear to load my 15 minute file…as the new/updated data is not there…

now…the data I was trying to load is for 2019-05-01, and the data file contains data only for the 15 minute time window.

I’m guessing…(and somewhat fearing)…that to update a datasource that was/is being streamed in…I will need to:

  1. First ‘kill’ the streaming task

  2. Run the ‘update’ task with the file containing updates.

  3. Resume the original streaming task?

will this appoach work?, is there a better approach?T

Hi Daniel,

The only time you should have to do the steps you mentioned (stopping streaming ingestion before running a batch job on the same datasource) is if you get a lot of late data and the stream task and batch task are trying to modify the same time interval. This is because tasks have to acquire a lock on the interval they are generating segments for and there would be frequent lock contention.

But this does not seem to be the case, at least for your test where you have no new data coming in, so you should be able to run your update job while the streaming task is still running.

Could you provide some details on what you’re trying to accomplish with the update job? Does the update file contain the same rows of data that was ingested from the stream for that time window, perhaps augmented with additional or modified fields?

Also, if you could provide the task log from index_test_1day_2019-05-24T10:12:59.856Z, it may provide some hints as to why the data isn’t being loaded.



This has been resolved…

My update file had bad json in it…(still not sure why it caused/appeared that I could not get any task info after that…for a really really long time).