Help with ingesting JSON and mixing single- and Multi-Dimensional values into Druid via Kafka

  1. Imagine i ingest 4 rows into Kafka, thats read by Druids Kafka-Firehose. If i aggregate “count” with “all” granularity, then i get “4” as a result:

{“ts”:“2015-01-01”,“count”:1,“dim”:“dimVal1”}
{“ts”:“2015-01-01”,“count”:1,“dim”:“dimVal2”}
{“ts”:“2015-01-01”,“count”:1,“dim”:“dimVal3”}
{“ts”:“2015-01-01”,“count”:1,“dim”:“dimVal4”}

``

  1. Is this equivalent to ingest the following row? And would i still get 4 as a result? This would already save me quite some messages.

{“ts”:“2015-01-01”, “count”:1,“dim” : [“dimVal1”, “dimVal2”,“dimVal3”,“dimVal4”]}

``

  1. If 2) works, can i also mix multi-value and single value? And i still get 4 as a result?

{“ts”:“2015-01-01”,“count”:1,“dim”:“dimVal1”}
{“ts”:“2015-01-01”,“count”:1,“dim”:“dimVal2”}

{“ts”:“2015-01-01”,“count”:1,“dim”:[“dimVal3”,“dimVal4”]}

``

  1. What about this? Is count = 4+(4*2) = 12?

{“ts”:“2015-01-01”, “count”:1,“dim” : [“dimVal1”, “dimVal2”,“dimVal3”,“dimVal4”]}
{“ts”:“2015-01-01”, “count”:1,“dim” : [“dimVal1”, “dimVal2”,“dimVal3”,“dimVal4”], “otherDim” : [“otherDimVal1”, “otherDimVal2”]}

``

because it actually is equivalent to:

{“ts”:“2015-01-01”,“count”:1,“dim”:“dimVal1”}
{“ts”:“2015-01-01”,“count”:1,“dim”:“dimVal2”}
{“ts”:“2015-01-01”,“count”:1,“dim”:“dimVal3”}
{“ts”:“2015-01-01”,“count”:1,“dim”:“dimVal4”}
{“ts”:“2015-01-01”,“count”:1,“dim”:“dimVal1”,“otherDim” : “otherDimVal1”}
{“ts”:“2015-01-01”,“count”:1,“dim”:“dimVal1”,“otherDim” : “otherDimVal2”}
{“ts”:“2015-01-01”,“count”:1,“dim”:“dimVal2”,“otherDim” : “otherDimVal1”}
{“ts”:“2015-01-01”,“count”:1,“dim”:“dimVal2”,“otherDim” : “otherDimVal2”}
{“ts”:“2015-01-01”,“count”:1,“dim”:“dimVal3”,“otherDim” : “otherDimVal1”}
{“ts”:“2015-01-01”,“count”:1,“dim”:“dimVal3”,“otherDim” : “otherDimVal2”}
{“ts”:“2015-01-01”,“count”:1,“dim”:“dimVal4”,“otherDim” : “otherDimVal1”}
{“ts”:“2015-01-01”,“count”:1,“dim”:“dimVal5”,“otherDim” : “otherDimVal2”}

``

(Its all about saving messages in Kafka, because right now i inflate into single-dimension rows, which blows up the system and feels like a bad path to go)

Thank you. :slight_smile:

Hi Olaf, please see inline.

  1. Imagine i ingest 4 rows into Kafka, thats read by Druids Kafka-Firehose. If i aggregate “count” with “all” granularity, then i get “4” as a result:

{“ts”:“2015-01-01”,“count”:1,“dim”:“dimVal1”}
{“ts”:“2015-01-01”,“count”:1,“dim”:“dimVal2”}
{“ts”:“2015-01-01”,“count”:1,“dim”:“dimVal3”}
{“ts”:“2015-01-01”,“count”:1,“dim”:“dimVal4”}

``

  1. Is this equivalent to ingest the following row? And would i still get 4 as a result? This would already save me quite some messages.

{“ts”:“2015-01-01”, “count”:1,“dim” : [“dimVal1”, “dimVal2”,“dimVal3”,“dimVal4”]}

``

A query that uses Druid’s count aggregator returns the number of Druid rows for your time interval. It does not count the number of unique values. There are other ways to obtain the number of unique values in a dimension (e.g. you can return an approximate count using the hyperUnique aggregator or the cardinality aggregator)

  1. If 2) works, can i also mix multi-value and single value? And i still get 4 as a result?

{“ts”:“2015-01-01”,“count”:1,“dim”:“dimVal1”}
{“ts”:“2015-01-01”,“count”:1,“dim”:“dimVal2”}

{“ts”:“2015-01-01”,“count”:1,“dim”:[“dimVal3”,“dimVal4”]}

``

  1. What about this? Is count = 4+(4*2) = 12?

{“ts”:“2015-01-01”, “count”:1,“dim” : [“dimVal1”, “dimVal2”,“dimVal3”,“dimVal4”]}
{“ts”:“2015-01-01”, “count”:1,“dim” : [“dimVal1”, “dimVal2”,“dimVal3”,“dimVal4”], “otherDim” : [“otherDimVal1”, “otherDimVal2”]}

``

because it actually is equivalent to:

{“ts”:“2015-01-01”,“count”:1,“dim”:“dimVal1”}
{“ts”:“2015-01-01”,“count”:1,“dim”:“dimVal2”}
{“ts”:“2015-01-01”,“count”:1,“dim”:“dimVal3”}
{“ts”:“2015-01-01”,“count”:1,“dim”:“dimVal4”}
{“ts”:“2015-01-01”,“count”:1,“dim”:“dimVal1”,“otherDim” : “otherDimVal1”}
{“ts”:“2015-01-01”,“count”:1,“dim”:“dimVal1”,“otherDim” : “otherDimVal2”}
{“ts”:“2015-01-01”,“count”:1,“dim”:“dimVal2”,“otherDim” : “otherDimVal1”}
{“ts”:“2015-01-01”,“count”:1,“dim”:“dimVal2”,“otherDim” : “otherDimVal2”}
{“ts”:“2015-01-01”,“count”:1,“dim”:“dimVal3”,“otherDim” : “otherDimVal1”}
{“ts”:“2015-01-01”,“count”:1,“dim”:“dimVal3”,“otherDim” : “otherDimVal2”}
{“ts”:“2015-01-01”,“count”:1,“dim”:“dimVal4”,“otherDim” : “otherDimVal1”}
{“ts”:“2015-01-01”,“count”:1,“dim”:“dimVal5”,“otherDim” : “otherDimVal2”}

``

(Its all about saving messages in Kafka, because right now i inflate into single-dimension rows, which blows up the system and feels like a bad path to go)

I think your use case is around a count distinct of a dimension, and if approximate results are sufficient, you should look into Druid’s approximate hyperUnique aggregation. It’ll tremendously reduce the amount of storage you require and queries will be significantly faster.

Hello Fangjin,

thanks for your answer. I do not need the hyperUnique aggregation in this case. I need each value of the dimension for queries.

What i originally have is this:

{“timestamp”:“2015-01-01”, “dim” : [“dimVal1”, “dimVal2”,“dimVal3”,“dimVal4”]}

``

I transform it to 4 single events:

{“ts”:“2015-01-01”,“count”:1,“dim”:“dimVal1”}
{“ts”:“2015-01-01”,“count”:1,“dim”:“dimVal2”}
{“ts”:“2015-01-01”,“count”:1,“dim”:“dimVal3”}
{“ts”:“2015-01-01”,“count”:1,“dim”:“dimVal4”}

``

``

And i send this to Druid. Then i run queries like:

SELECT count(*) FROM druid where dim = dimVal1 OR dim = dimVal

``

But to make 4 events out of 1 original event is overkill, when i send it over kafka. That is why i was wondering, if it makes sense to just send:

{“timestamp”:“2015-01-01”, “count”:1, “dim” : [“dimVal1”, “dimVal2”,“dimVal3”,“dimVal4”]}

``

In the data format section of the documentation Lists are mentioned, but not really described.

To myself. The more i think about it, the more clear it gets, that it makes no sense, what i want to do. If an event occurs, it occurs together with these dimensions. Still i think the druid doc could be more clear about list values. They only explicitly mention it for csv/tsv, somewhere at the bottom of the page.

Given that your query is :
SELECT count(*) FROM druid where dim = dimVal1 OR dim = dimVal

If you use the following event:

{“timestamp”:“2015-01-01”, “dim” : [“dimVal1”, “dimVal2”,“dimVal3”,“dimVal4”]}

The query will result a result of 1 as Druid will determine there is one row that matches the where clause.

Agree that the docs could be more clear about list values and their intended purpose. It is a great chance to make a contribution if you were interested in that :slight_smile: