Daily batch hadoop indexing job when events can be present across day boundaries

I want to run a daily scheduled batch indexing job to index all the data which has arrived on the previous day but the problem is that data which has arrived yesterday will have few events belonging to 2-3 previous days.
Hence i am not sure what should be the value of “interval” field of “granularity spec”?

Should it be the date of yesterday or a larger interval covering the entire time period for which events could be present in data?

What would happen to events which fall outside the interval?Will they be dropped?

If i am using the segment granularity as “Day” will even a single event belonging to older days create a new segment for that day and will override the data for older day?



Hey Rohit,

The “intervals” in the job spec are like a filter. Any data outside those intervals will be dropped. So, a good way to index all the data for a day (even if some of it might be in files marked for other days) is to put those extra files in your inputSpec, but only set “intervals” to the specific day you want to index.