Require comments/thoughts regarding non-timeseries data in druid

Hi everyone,

I have a specific use case, I am wondering if anyone else had dealt with a similar situation, and could share some thoughts.

I am planing to use druid for loading variety of datasets – their schemas may be completely different & dynamic. So, imagine different types of timeseries data with different dimensions and metrics. The problem is, what happens when the dataset doesn’t have a timestamp as one of the columns? If we use a fake fixed timestamp value, will it result in too much performance degradation in Druid?

My 75% cases are covered with the datasets that have timestamp as one of the columns, but the rest may not have timestamp, so I am wondering what is the best way to deal with this. Druid seems to provide several benefits such a column oriented store and compression, and ability to filter fast using bit-map index, etc…but not having time based segmentation…will that affect performance? if yes, how much?

Any thoughts/comments would be greatly appreciated!

Thanks!

Hi Raj, please see inline.

Hi everyone,

I have a specific use case, I am wondering if anyone else had dealt with a similar situation, and could share some thoughts.

I am planing to use druid for loading variety of datasets – their schemas may be completely different & dynamic. So, imagine different types of timeseries data with different dimensions and metrics. The problem is, what happens when the dataset doesn’t have a timestamp as one of the columns? If we use a fake fixed timestamp value, will it result in too much performance degradation in Druid?

My 75% cases are covered with the datasets that have timestamp as one of the columns, but the rest may not have timestamp, so I am wondering what is the best way to deal with this. Druid seems to provide several benefits such a column oriented store and compression, and ability to filter fast using bit-map index, etc…but not having time based segmentation…will that affect performance? if yes, how much?

You can tell Druid to load data by using the “missingValue” field in the timestamp Spec. This will let Druid create a dummy timestamp column with a fake date you create. The performance penalty is that Druid will scan all segments related to a datasource for any query. How great that penalty is will depend on how many segments you have.

Thanks Fangjin for the reply.

HI Fangjin,

I know this is an old topic but it’s interesting for my use case. We have a use case where we would probably need to use the “missingValue” approach for the timestamp spec. However my question is how would this affect lookups?

Basically we have a User which has gender, age, some interests and registration to a event. We would like to get data for an event with dimensions on their age, gender and interests and the metric would be counts of those users.

As you can imagine a user changes their interests we would need to update those, would using the “missingValue” approach allow for us to update the interests for that user in all their registrations without hitting issues?

Regards

Mark

Hi Mark, the missingValue flag only creates a dummy timestamp field where all the timestamps are the same. It should not impact lookups and you should still be able to make updates to an external table and join them with Druid.

Thanks Fangjin, I will have to look into how to do it, Do you know of any code fragments or more documentation to do the lookups apart from http://druid.io/docs/0.9.0/querying/lookups.html ?

There’s documentation in the lookup extensions.