Hi Druid devs/users!
We want to use Druid as a fast data store to be able to query millions to billions of data points. We have thousands of files in S3 amounting to thousands of hundreds of gigabytes that are generated everyday. However, there are cases where event data for previous days come in late, and we want to find a way to handle these adjustments/late arrivals. So we have the following questions:
- What are the usual ways to handle adjustments? We don’t want to reingest a whole day’s worth of event data when late data comes in, that’ll be wasteful of computing resources.
- An idea we are thinking of is:
- Store the late data into another table in Druid called late_events using files that came in late
- However we will have 2 timestamps:
- the actual event time (late)
- the time that it was ingested into Druid
- The segments will be saved using the timestamp (no 2) that it was ingested into Druid
- The actual event time will exist as field to allow us to query day-level adjustments
- However, the JSON information in the files do not contain the timestamp that the file was created.
- Question is, is there a way to specify a fixed value for a dimension in the ingestion spec? e.g. I specify a value of “2017-06-05 01:00:00” for all data ingested. Or is there a way to use the S3 file properties (namely the last created datetime)
Thanks!
Cheers,
YC