Partitioning data by dimension

Hi,

I’m using druid to store time-series data which I’m pulling in through kafka-indexing-service. Is there a way I can partition the segments by a dimension in addition to timestamp? I understand the granularitySpec is for partitioning the data by timestamp, but I additionally need to partition on an ID dimension in order to enable deletion for the required elements.
I’ve gone through using Tranquility’s implementation of the kafka firehose, but the docs are really limited in perspective of what I’m trying to achieve.

Please help.

Hey Swapneel,

The best ways to partition data ingested from Kafka right now are to use upstream partitioning in Kafka (which Druid will respect – each Kafka partition’s data is only going into one Druid segment at a time) or to use reindexing (i.e. run a background batch job periodically to repartition data for a time chunk).

Hi,

Thanks for the prompt response. Both of your approaches seem like viable options. However, I’m fairly new to using druid, so are there any relevant links/resources which I can refer for the same?

Hey Swapneel,

For upstream partitioning in Kafka, just set up Kafka partitioning (using the Kafka producer APIs). For reindexing in Druid, you could do it with a native indexing task using an “ingestSegment” input: http://druid.io/docs/latest/ingestion/firehose.html.

Hi Gian,

That looks promising. I’ll have a look at it tomorrow, thanks for helping!

It’s worth noting that (as far as I understand) Kafka partition-based partitioning doesn’t actually reduce the number of segments that need to be consulted to answer a query. I’m also not certain what you mean by “in order to enable deletion for the required elements”. I don’t think Druid supports deletion without reingesting all partitions of data for an interval. That said, the ingestSegment firehose works pretty well for deletion based on a dimension (just use its filter property).

Hi David,

According to the docs, deletion by segments is possible. However, if I do not have my data partitioned across the ID dimension in addition to the time dimension, the segments will contain a mixed bag of IDs, which will make deletion by ID possible only with reindexing, which I’m trying to avoid. If possible, I’m trying to make druid generate segments based upon the ID dimension too, so Druid can just look for the segment for ‘ID’ within a time interval, and delete that.

Ah, you’re saying that the you’ll want to delete everything from a given partition at once? I guess you can do it that way, though if there are a large number of ID values that means you’ll have a ton of tiny segments, which probably won’t perform well?

Hi guys,

Thanks for your valuable suggestions. I was able to push 1.7 million records over a year into druid and reindex the data using filters to neglect the said ID, and reindexing took druid 39 seconds to complete with segment granularity of 1 day. I haven’t partitioned by ID yet, I’ll need to perform benchmarks with that in place.