[druid-user] Looks for some advises in schema

Hi all,
We’re pretty new in Druid, and have general question about this use case:
source: __time, id, api, duration

We need to support these cases:
select * from source where duration>1 order by __time desc
select * from source where duration>1 order by duration desc

In order to filter by duration correctly, we can’t store duration as string. For this case, should we put duration in dimension or metrics? Based on our quick test, we don’t see much performance differences. If we’ve a huge amount of real time data, what is your recommendation here?
Thanks
|
|

Hi,

Could you please share a bit more about the format of duration? I know of at least one use case where ingesting a column as a non-string really degraded the query performance.

Best,

Mark

Hi Mark, Thanks for your reply!
Duration is a long integer, such as 123435500.

{“name”: “duration”, “type”: “long”}

Best,

Meili


Hi Meili,

Thanks for that, and apologies for not thinking about this more from a Druid perspective from the beginning. Let me back up a bit:

We’re pretty new in Druid

When designing your schema, there are lots of things to think about. One that I’ve seen come up many times is: what types of queries will your end users be running? Imply (my employer) has a free course on Ingestion and Data Modeling that might be worth your time.

For this case, should we put duration in dimension or metrics?

If you were able to ingest duration as a string you could do both:

If you are using schema-less dimensions, the best practice here is to include the same column twice, once as a dimension, and as a hyperUnique metric.

Regarding

Based on our quick test, we don’t see much performance differences.

Numeric dimensions aren’t indexed, so they can be slower to filter on and require experimentation to find the optimal choice for your use case.

Finally, you might look at partitioning. If you look at the dimensionsSpec doc, there’s the following bit of guidance:

As a best practice, put the most frequently filtered dimensions at the beginning of the dimensions list. In this case, it would also be good to consider partitioning by those same dimensions.

Hopefully any of this is helpful.

Best,

Mark

Hi Mark, Thanks for your explanation!
Yeah, I’ve already taken Imply’s course Ingestion and Data Modeling.

bests,

Meili

Boot-framework
Boost framework () :mperf = com.qualcom.qti.performance@4550754
19089|
. …<|>10:-05 /¹8:23:39.838(½⅓¼⅛)