I’m ingestion simple netflow data in a csv format like this:
“src_addr”, “dst_addr”, “dst_port”, “ip_proto”, “policy_id”, “group_id”, “pkts_sent”, “bytes_sent”, “avg_pkt_len”
with the metrics being the last 3, the others being dimensions.
(a timestamp gets prepended as well)
This is through kafka.
These are all integer values, the size of each field (in bytes) are:
4, 4, 2, 2, 2, 2, 4, 4, 4
for 28 bytes of data.
Instead it seems like each message ends up in Druid (with dimension compression off) being around 50 bytes, and turning on lz4 compression gets that down to about 40 bytes.
Is it because dimensions are always strings and so the cardinality estimator isn’t able to reduce the data down well?
Is it possible to just have all metrics, or will that mess things up for queries?
(Thanks a ton again for the quick responses, you guys have been great!)