Sharding along time and a dimension

Hello,

I am evaluating Druid for our time series analysis. A little brief about the data set.

We receive check data from restaurants in the real time (Kafka/Storm). All the analysis circles around check time (mostly daily with some hourly). 99% of queries requires time series analysis for one restaurant location.

As I was reading through Druid documentation, I see segments are created on the time windows (daily, hourly, etc). I was read about sharding along the dimensions on top of time slice but could not find enough documentation about that.

Can somebody point me to some documentation that explains how this can be possible or some leads?

Thanks,

Satish

Hi Satish, with batch-based ingestion, you can shard first on the time dimension, then additionally on an additional dimension: http://druid.io/docs/latest/ingestion/batch-ingestion.html (see single dimension partitioning).

I hope this helps, but if not, I’d love to learn more about what your requirements are.

Best,

FJ

Hi Fangjin,

Good to know that it is possible using batch. Our use case is little weird. We get data from 1000’s of restaurants (we call it as location). Most of the restaurants stream us data that fits the realm of real time. However, there are significant few that are laggards that send us data for last few days at once. I wanted to know if Druid can isolate this having partitioning on time and then location. This avoids one location stepping on each other in the time series.

On this very topic, I had one more question. There are cases for us when we bring a customer on board. This on-boarding can be done using batch indexing. Is there way to delete or deactivate a segment (ours will be based on time and location)? I read one of the discussion, where we can deactivate the segment in MySQL but the segment in deep storage will be orphaned. In this case, would the segment information carry information on time and most importantly the location (the dimension) on which we defined the segment. This use case is more on the line for rolling back a batch.

I understand that deleting is sub-optimal in any form pre-aggregated form of dataset or a pure data warehouse, but Druid could solve our 95% of use cases. I am trying to find best (or less than perfect) solution for the above cases.

Thanks,

Satish

Hi Satish, see inline.

Hi Fangjin,

Good to know that it is possible using batch. Our use case is little weird. We get data from 1000’s of restaurants (we call it as location). Most of the restaurants stream us data that fits the realm of real time. However, there are significant few that are laggards that send us data for last few days at once. I wanted to know if Druid can isolate this having partitioning on time and then location. This avoids one location stepping on each other in the time series.

We see this use case fairly often with other types of data as well. The current solution is to run a lambda architecture. Use Druid’s realtime ingestion capabilities to query events as they are coming in, and store a copy of the raw data so that at the end of a day (or several days later), you can run a batch indexing job that includes all the laggard events. The batch indexed segments is the golden copy of the data and replaces the realtime segments.

On this very topic, I had one more question. There are cases for us when we bring a customer on board. This on-boarding can be done using batch indexing. Is there way to delete or deactivate a segment (ours will be based on time and location)?

You can set rules to do this automatically. More info here: http://druid.io/docs/latest/operations/rule-configuration.html