We’re experimenting with native batch ingestion on our cluster for the first time (with a custom firehose reading from files saved to GCS by Secor, with a custom InputRowParser).
There was a period of a week where the data source had no data. We ran batch ingestion (index_parallel) over one particular hour (5am-6am on December 16th) and it successfully ingested that hour — a segment showed up in the coordinator, it could be queried, etc. (Our segment granularity is HOUR.)
Then we ran it again on the entire 24 hours of December 16th. It ran 24 subtasks (our firehose divides up by hour) and ingested the full day, yay!
Except when we look in the coordinator, it now lists 2 segments with identical sizes for the 5am hour that we first tested with. Also both of them are listed with the same version which was from the first batch ingestion, not the version that the other 23 segments have from the second batch ingestion.
We did not explicitly specify appendToExisting in our ioConfig but I believe the default is false and looking at the task payload it is expanded to false.
Are we doing something wrong if our goal is to replace existing segments?
The bad hour (I hope inline images are OK for this list):
A good hour: