Historical DAY granularity and realtime MINUTE granularity

Hi all,

Is it possible to have 1 datasource with different granularities for historical and real-time? What will happen if we do that?

THanks, Vadim.

Yes, you can have that. Datasource granularities are only required consistent within a single segment.

The only thing you’ll want to make sure of is that you don’t try to prematurely run batch jobs while the realtime jobs are going (indexing service has some guards against such a case). Ex: don’t try to index 2016-01-05Z while you’re still running realtime tasks for 2016-01-05T23:00:00Z. Be aware that you may have late data coming in, so simply waiting for UTC rollover is not a good idea, you’ll have to either manually add a safety buffer in your timing or rely on the indexing service’s locking to ensure that the realtime tasks are done before the batch fires off for the day.

What will happen if we have a segment for a day and after that will try to ingest a segment for a hour of the same day?

In general that’s a bad idea, but here’s how it will “work”:

The query planner will look at the available segments and the date ranges they cover and split up the query to only use the most up to date (last version lexicographically) segments for each interval. So for example, if you have a Day segment granularity covering all of 2016-01-05Z and an Hour segment covering 2016-01-05T23:00:00Z/PT1H , then the Day segment will get a query for interval T00Z (inclusive) through T23Z(exclusive) and the hour segment will get T23Z(inclusive) through T24Z(aka next day, exclusive)

Here’s where it gets tricky.

If the QUERY granularity is set to HOUR for both segments, then the hour-segment will OVERWRITE/REPLACE the hour of granularity in the day segment at query time.

If the QUERY granularity for the day-segment is set to DAY, and the QUERY granularity of the hour-segment is set to HOUR, then the hour segment effectively APPENDS the day segment. (assuming you query for a full day’s worth of data)

The reason for this is that the QUERY granularity means “treat all events in this segment as if they occurred at the start of the granularity”. So for the day segment with day QUERY granularity, all events effectively occur at T00:00:00.000Z of that day. And for the hour segment with hour granularity, all events are treated as if they occurred at TXX:00:00.000Z of that hour.

If you are looking to append data, then you’ll want to follow https://groups.google.com/d/msg/druid-development/kHgHTgqKFlQ/fXvtsNxWzlMJ which deals a lot with how to handle appending data in a better way.

Hopefully that helps.

How about if we :

batch index hourly, ie. submitting hadoop indexing task for every hour, but we want DAY segment and query granularity ?

I tried to do that : http://pastebin.com/raw/TayiGhRG

Task indexing takes ~ 4x more time, I suspect that it :

  1. checks whether DAY segment exists in deep storage

if so, it downloads it and merges the HOUR segment that is being currently indexed with it

if not, it just uploads the HOUR segment to deep storage

  1. uploads the merged result back to deep storage to as a DAY segment

Is this the reason why the indexing takes ~4x more time ?