Questions about event counting and dimensions

Hi - while learning about Druid I put together a little example that runs Druid broker, coordinator, historical and overlord nodes in Docker containers, along with an example application that generates random events and sends them to Druid via Tranquility:

https://github.com/Banno/druid-docker

A random event is generated roughly every 100 msec, and batches of 10 events are sent at once via Tranquility. So there should be roughly 600 events per minute going to Druid. Index granularity is 1 minute and segment granularity is 1 hour.

If I include eventId (just a random UUID) as a dimension, then a timeseries query at minute granularity for count aggregation [1] consistently returns counts of 575 or 576 for each minute interval. I would expect this to be 600. Any ideas why there would be 24-25 less events counted each minute than expected? It does take 1.5-2 msec to do the send to Tranquility, but I don’t think that accounts for 25 missing events every minute.

However, if I don’t include eventId as a dimension [2] then the same timeseries query [1] only ever returns between 15-21 events for each minute interval. This is nowhere even remotely close to expected 600 events per minute. Any ideas why this would be? Does Druid need to have some kind of unique ID dimension to be able to count individual events?

Thanks,

Zach

[1] https://github.com/Banno/druid-docker/blob/master/query/random-counts.json

[2] https://github.com/Banno/druid-docker/blob/master/random-tranquility/src/main/scala/Main.scala#L44

This got posted as a direct response instead of a forum response, so let me try this again :slight_smile:

Hi Zach, excellent questions.

Your segment should contain a metric related to counting defined at ingestion time, something like this:

    {

        "name" : "count",

        "type" : "count"

    }

Which you do in https://github.com/Banno/druid-docker/blob/master/twitter-tranquility/src/main/scala/config.scala#L54

Then you can do a longSum on that metric in the query like this:

“aggregations”: [

{ "type": "count",       "name": "row_count" },

{ "type": "longSum",     "name": "total_count",      "fieldName": "count" }

]

That will hopefully clear up what’s happening in the second case (no event-id). I suspect that druid is able to optimize the “rows” based on the query granularity.

I use the 1GB TPCH dataset for most of my internal testing

$ wc -l /my/directory/structure/lineitem.tbl

6001215 /my/directory/structure/lineitem.tbl

and this query:

{

“queryType” : “timeseries”,

“dataSource” : “tpch_year”,

“granularity” : “all”,

“intervals”: [ “1970-01-01T00:00:00.000/2019-01-03T00:00:00.000” ],

“aggregations”: [

{ "type": "count",       "name": "row_count" },

{ "type": "longSum",     "name": "total_count",      "fieldName": "count" }

]

}

Yields the correct results of:

[

{

    "result": {

        "row_count": 6001215,

        "total_count": 6001215

    },

    "timestamp": "1992-01-02T00:00:00.000Z"

}

]

And this data is setup such that there are no rollups/optimizations available. You should be able to forcefully disable the optimizations with a query granularity of ALL .

For the ingestion where event_id is present, that one I’ll have to defer to someone with more experience in the ingestion side.

Cheers,

Charles Allen

Thanks for the advice Charles! Now when I don’t include eventId as a dimension and use a query like this that includes a longSum aggregation on the count field:

{

“queryType”: “timeseries”,

“dataSource”: “random”,

“intervals”: [

"2015-03-01/2015-04-01"

],

“granularity”: “minute”,

“aggregations”: [

{

  "type": "count",

  "name": "row_count"

},

{

  "type": "longSum",

  "name": "total_count",

  "fieldName": "count"

}

]

}

I get results typically like this:

{

“timestamp” : “2015-03-12T21:23:00.000Z”,

“result” : {

"row_count" : 18,

"total_count" : 576

}

}

The total_count result is what I would expect.

Is there any documentation you could point me to, to learn more about what exactly rows are in segments? That was the source of my confusion. I will make sure to longSum the count field in queries from now on.

If anyone else has thoughts on where those other 24-25 events per minute are going (which amounts to 4% of data missing), I would really appreciate it.

Thanks,

Zach

I’m guessing that the queries only showing around 575 events per min is due to a timing issue - the example code is probably just not quite generating 600 events per min. I did some tests on a prototype of our real system where we generate events more realistically, and was able to see both 600 events per min and 1200 events per min almost exactly in druid queries. So probably nothing wrong there to look into further.