I am debating over using Cassandra or Druid for storing TimeSeries information.
My use case involves storing unique identifiers of around a billion entities, and querying time series data per entity per day.
In Cassandra my schema could look like /l’
UIUD --> primary key
Columns could be --> Time, Count
The problem then would be aggregations over a time range which Druid would simplify for us, but querying over primary key would be super fast.
I am not sure how Druid performs over such high cardinality data.
Would really love to hear suggestions on the same.
Druid does index its string columns so it can do fast filtering by any of them. The speed of this filtering is somewhere between true key-value systems (which Druid is not) and most of the popular analytical columnar stores today (which often scan through data without using indexes). So you may find that Druid has the sweet spot of record retrieval and aggregation performance that you are looking for.
I see…and is there any way where I can just copy all the data from druid into Cassandra for faster lookups ?
I am not sure if there is a distributed way of querying druid.
To elaborate, let me start with an example:
I store some time series info for about 500 million UIUDs for which Druid is great.
I then have a service which is based off druid REST, and it is good to get data per UIUD.
Now I want to ship this data for all UIUDs into a Cassandra like store which is optimized for quick lookups.
Question is : How do I achieve a data dump of that kind.