Ingesting integer dimensions?

I’m ingestion simple netflow data in a csv format like this:

corresponding to:

“src_addr”, “dst_addr”, “dst_port”, “ip_proto”, “policy_id”, “group_id”, “pkts_sent”, “bytes_sent”, “avg_pkt_len”

with the metrics being the last 3, the others being dimensions.

(a timestamp gets prepended as well)

This is through kafka.

These are all integer values, the size of each field (in bytes) are:

4, 4, 2, 2, 2, 2, 4, 4, 4

for 28 bytes of data.

Instead it seems like each message ends up in Druid (with dimension compression off) being around 50 bytes, and turning on lz4 compression gets that down to about 40 bytes.

Is it because dimensions are always strings and so the cardinality estimator isn’t able to reduce the data down well?

Is it possible to just have all metrics, or will that mess things up for queries?

(Thanks a ton again for the quick responses, you guys have been great!)

Hey Ron,

I would guess part of this is due to the conversion of the dimensions to strings, and part of it is the addition of indexes (each dimension gets a bitmap index used for boolean filtering). You probably do want to keep those fields as dimensions- since anything that you want to be able to filter or group on needs to be a dimension. Ultimately it would be cool to support non-string dimensions in Druid, but that is not currently possible.

How well does Druid handle dimensions that are integers? I thought I had read somewhere that it looks for such things and handles them better - so “1000000000” doesn’t require 10 bytes of data… Is that true?

Hey Ron,