We want druid to contain only one segment for each granularity period and datasource. Is there any way to do it? We can set intermediatePersistPeriod to very large period. Is it good hack?
Could you explain your use case in a bit more detail? I’m not quite sure what you’re trying to achieve or avoid. Druid will only create one segment per granularity period unless you tell it to generate multiple partitions. The intermediatePersistPeriod is to spill data from memory to disk to avoid running out of memory and preserve data in case of a crash. All of the intermediate chunks get merged together into a single segment before being handed off to deep storage.
We want to query the count of unique events(rows in segment) using count aggregator at querytime and it works perfectly on historical but give us higher values on realtime.
And does it give you a correct value if you set your intermediatePersistPeriod very high? I’m not clear how the two issues are related.
What do you mean by getting the “count of unique events(rows in segment)”? Most people run into issues when they use a count aggregator at query time trying to get the total number of events ingested but instead get the number of rolled up rows, and they should be using a longSum aggregator instead (see: http://druid.io/docs/latest/design/index.html).
it gives us right figures after segment is moved to historical. And yes we need the number of rolled up rows (unique combinations of dimensions values).
You could set maxRowsInMemory and intermediatePersistPeriod both very high and then you’d get what you want. But, this is kind of a nasty hack, because it prevents Druid from persisting data out of heap onto disk. So you run the risk of OOMEs and GC problems.
One other approach you can do is a “byRow” cardinality query, which will give you an approximate answer to your question, and will work even if you allow the realtime system to persist to disk more often. See “cardinality by row” here: http://druid.io/docs/0.8.3/querying/aggregations.html