what's the proper value of the window period and segment granularity

hi, all

for my business, user could query event by minute or day.

I want to set segmentGranularity to hour, queryGranularity to minute.

but the question is the event may delay several days (at most 15 days) to my system, and I don’t know the proper value of the windowPeriod.

this is from the official documents

The normal, expected use cases have the following overall constraints: queryGranularity < intermediatePersistPeriod =< windowPeriod < segmentGranularity

Generally for handling delayed data we recommend a hybrid realtime/batch setup. The way that would work is that your ‘on time’ (relative to windowPeriod) data is indexed in realtime, but you’re also saving a copy of all your data to S3/HDFS. Then you’d have a scheduled job that reindexes some sliding window of data, which will get all of the late data in that window loaded up.

Two sets of things that can do the batch path are Kafka -> Camus -> HDFS -> Druid Hadoop indexer or Kafka -> Secor -> S3 -> Druid Hadoop indexer.

thx, Gian.

while I don’t find resources about Camus, Secor, could you share me some link/web page about Camus & Secor?

在 2015年7月16日星期四 UTC+8上午12:10:38,Gian Merlino写道:

oh, I find something. haha.

在 2015年7月31日星期五 UTC+8上午10:21:42,何文斌写道:

Can i have a setup where indexing service will only need to work with delta
For example if there were Y events that were late and dropped by realtime get logged to some other location and then schedule job just look at those Y events and reindex them

The pipeline would look something like

kafka-> Druid realtime -> Dropped from Druid-> Stored at some location -> Druid Hadoop indexer only for late comers

With the current version of Druid, no, but an upcoming release will support delta ingestion (it’s currently in the works).