Reload data based on timestamp column different than ingestion spec timestamp

We sync our data to cloud storage (S3/Azure blob) based on the timestamp at which it was synced (different from the event_time). However, we load data into Druid based on the event_time in our data so that the queries can be on event_time instead of the sync_time. When we need to correct our data, we typically would want to load the data for an entire day (sync_time) after cleaning up the data in Druid. I understand that the segments are created based on the ingestion spec timestamp column (in this case event_time) and hence we cannot drop the segment in entirety as it might contain data from different sync_time data. How do we handle such scenario where your sync_time and event_time are different when reloading data into Druid? Any help will be much appreciated.

Regards,

-Anand

Hey Anand,

You could do this by ‘widening’ your inputs a bit relative to your output interval. For example, if you want to load data for 2019-04-10, you could in your ioConfig specify files for both 2019-04-09 and 2019-04-10, but then in the granularitySpec only specify “2019-04-10/P1D” as the interval. In this case, all files for both days will be read, but only event times for 2019-04-10 will be loaded (your Druid data for 2019-04-09 will be untouched). This is because the job interval acts as sort of a filter.

Gian

Hi Gian,

Thanks a lot for your suggestion. I still did not understand on how to delete the data from the existing segments as the data based on event_time for a single sync_date could be spread over multiple segments. Also, please let me know if we can delete data based on another time column other than the segment timestamp.

Thanks,

-Anand

Hi Gian,

Do you have any suggestions on this question? It would be really helpful if you can let me know a way to delete data which spans out multiple segments.

Thanks,

-Anand

Hey Anand,

I still did not understand on how to delete the data from the existing segments as the data based on event_time for a single sync_date could be spread over multiple segments.

Based on your original question, the cleanest way to repair data is to do a reingestion of the original data in ‘overwrite’ mode (the default for batch ingestion). So if you do the ingestion method I suggested, it will overwrite the existing data, and the correction should be accomplished.

Also, please let me know if we can delete data based on another time column other than the segment timestamp.

Not with the segment drop APIs; those only work based on the primary time column.