Druid for non timeseries data?

Hi,
I’m evaluating using Druid for as the backend datastore and filtering engine for a data visualization project.

I’ve gone through the documentation, the paper and tried out a few examples.

The only problem I have is the dependence on the Timestamp dimension.

I don’t expect my datasets to have any Timestamp dimension. Is Druid still the right choice for such datasets? I want the following features:

  1. Sub second querying on large datasets(fast filtering).

  2. Fast binned aggregation over different attributes.

All my queries would be categorical, ranged or (maybe geospatial).

We want to be doing the usual slicing-dicing and drilling down in our datasets.

Would druid be able to work with such datasets? Are there any “hacks” that we could do(example: append a proxy timestamp attribute to each record).

I know that the timestamp dimension is used to shard the data. How much of a performance hit would it be to use a fake timestamp dimension. Some questions:

  • Are there better ways of doing this? I might be missing something basic since I haven’t used druid much.

  • If people are using it for non-timeseries data than I’d be interested in knowing about it

  • If there are some alternate systems that do this and would be a better fit for me than I’d like to know about them too.

Hi,

Timestamp is very important for many reasons like :

the fact that data will be partitioned using time ranges i am not sure in your case what would be the case.

query will be sent against a give time interval in order to narrow down the amount of IO

….

not sure how this will work for you but you can always give it a try.

Hmm yeah I realized that, but I was wondering if people are using it without it. What kinds of performance hits they’re getting

Also I was wondering if some other candidate attribute(X) could act as proxy for the timestamp attribute. So all the sharding and partitioning that you mentioned happens on X instead of the timestamp.

Also since Druid advertizes itself as a “Druid is a high-performance, column-oriented, distributed data store.” instead of a “timeseries database” I guess this should be a valid use case.

There are many instances of using Druid for non-timeseries data. You don’t need a timestamp column to write data into Druid. We purposefully avoid calling Druid a ‘timeseries’ database because of the connotations associated.

Hi Ganesh,

I think that Druid is a very powerful datastore even if you are using it on data that does not have a timestamp column.

There are a lot of people that run use Druid without a time column.

There is a non-documented feature where in the ingestions spec’s timestampSpec you can set missingValue: xxx and have Druid ingestion auto fill your missing timestamp as a column with some constant.

Since Druid uses the timestamp column as the primary shard key once you go to production you would still want to put something into the timestamp column to make the sharding be more effective (usually a hash of some dimension that you actually want to shard on).

As for querying Druid, the Druid API has a lot of special treatment of the time column that should be ignored in your case.

In a project that I am working on, Plywood (http://plywood.imply.io/) the query API abstraction is designed to be symmetric with no special treatment given to the primary time column.

Similarly in PlyQL (https://github.com/implydata/plyql) a MySQL like API to Druid built on top of Plywood there is no special treatment given to the time column as is the SQL way.

Finally in Pivot (http://pivot.imply.io/) a data vis UI for Druid built on top of Plywood there is an option to not have time as a primary dimension (that always needs to be filtered on). In fact that is the true default state.

As you can see I am quite passionate about not using Druid for time based data and I have it on good authority that future versions of the Druid API will place less special emphasis on time also.

Notwithstanding FJ’s and Vadim’s comments, I do want to point out that ingestion architectures with Druid (especially streaming ingestion) are much more flexible if you have a timestamp as one of your columns. This is due to the fact that different areas of the timeline can be updated independently and that Druid makes it really easy to partition based on time. Also, Druid has a number of time-based query performance optimizations that don’t apply if you use some other column for partitioning. So while non-time-oriented datasets are possible, people often end up generating fake timestamps just to make those things easier. These fake timestamps could just be a hash of another column that is the one you really want to partition on.

None of that is meant to contradict FJ’s comment that people do find value in Druid for non-time-oriented data, or Vadim’s comment that the query APIs are trending towards time being less of a special thing at the API level. Those are both true, although time is and will likely remain important at the storage and query engine levels.

Thanks Fangjin, Vadim and Gian! :smiley:

@Fangjin: Great, yeah I noticed that, hence asked here :slight_smile:

@Vaidm: Yes, I did look at Plywood and Pivot and all that tooling around Druid made me more confident about using it as my backend datastore!
@Gian: Thanks will look into it. I wanted to get a general consensus of how people were using it. The fake-timestamp idea was something I was looking at but good to know that its being actually used.

There was a section in FAQ(probably, I’m not sure) which said that all records must have a “timestamp” attribute. I can’t find it anymore. So I guess the non-dependance on the timestamp attribute is making its way in the docs too :slight_smile:

Thanks again!