Druid Ingestion and handling adjustments

Hi Druid devs/users!

We want to use Druid as a fast data store to be able to query millions to billions of data points. We have thousands of files in S3 amounting to thousands of hundreds of gigabytes that are generated everyday. However, there are cases where event data for previous days come in late, and we want to find a way to handle these adjustments/late arrivals. So we have the following questions:

  1. What are the usual ways to handle adjustments? We don’t want to reingest a whole day’s worth of event data when late data comes in, that’ll be wasteful of computing resources.
  2. An idea we are thinking of is:
  • Store the late data into another table in Druid called late_events using files that came in late
  • However we will have 2 timestamps:
  1. the actual event time (late)
  2. the time that it was ingested into Druid
  • The segments will be saved using the timestamp (no 2) that it was ingested into Druid
  • The actual event time will exist as field to allow us to query day-level adjustments
  • However, the JSON information in the files do not contain the timestamp that the file was created.
  • Question is, is there a way to specify a fixed value for a dimension in the ingestion spec? e.g. I specify a value of “2017-06-05 01:00:00” for all data ingested. Or is there a way to use the S3 file properties (namely the last created datetime)

Thanks! :smiley:

Cheers,

YC

In Druid 0.10.0 you can set “appendToExisting” : true in your index tasks to avoid re-reading the entire day of data when you just insert a batch of late-arriving events. However if you do this too often then you can get fragmentation that affects your query performance, and you might want to reindex the whole day anyway to get rid of that fragmentation. But that reindexing could be done by reading from Druid and writing back to Druid, so you don’t have to hit the original raw data.

Hi Gian,

Firstly, thank you for the response. I forgot to say that the reason why we don’t want to reindex the whole day’s data, is because this data is going to be used for billing, so it is also important that we want to have as much information as possible to see day-level adjustments.

Not sure if you saw the second question, any idea if we can get Druid to ingest and specify a fixed value for a dimension that doesn’t exist in the data?

Thanks again!

YC

I would also like to understand if there is a way to have static fields during batch ingestion.