I am new to druid and as i was exploring the druid, i was wondering if there is any way to store dynamic fields in druid like we can do in lucene. Any insights would be helpful. Thanks in advance.
Can you describe a bit more what you mean by dynamic fields? Druid supports schema-less dimensions, so you are not bound by the dimensions specified when you setup your ingestion or indexing, new dimensions are added automatically.
We just started playing with Druid for a POC.
My observation was that : If we leave the dimensions as an empty array in the realtime.spec(dimensionless schema), we were able to add new dimensions runtime.
However, if I have a set of dimensions already mentioned and then I change the dimensions in the data at runtime, those were not reflected in the queries.
Is this expected behavior ? Or is there some config we might have missed ?
Any pointers will be really helpful. Thanks !!
Yes, the behavior you mentioned is expected.
Related question: In schema-less is there a way to keep dimensions from being consumed by metrics (say I want a hyperloglog of a dim but also to keep the original dim)
I don’t think so. Aggregator dependency columns are excluded from the dimension list.
However, does your usecase really need that? if you put that dimension in the dimension list then HLL aggregator wouldn’t really aggregate multiple values at indexing time anyway, in that case, you can just remove it from indexing and use cardinality aggregator at query time to find unique counts.
Thanks Himanshu !!
So basically, druid does not adjust to dynamic dimension change unless it is dimensionless.
Does having a dimensionless schema have any performance or storage implications for queries ? For any particular kind of queries timeseries or group by ?
Also, if the dimensions are expected to change rarely, would it be recommended to update the dimensions and restart the realtime nodes instead of using a dimensionless schema ?
Huh I was under the impression that aggregating at index time would be a performance boost over cardinality aggregator at query time, is this not the case if I leave the dimension in? Basically I’ve got a high-cardinality id field that I was hoping to get cardinalities from quickly in a UI, while also being able to pull the ids themselves out (which I realize will be a much slower process).
If you add that field as a dimension then those rows will not get merged at index time. Each row’s HLL column will just be having one value represented inside it.
Yes, adjustment is made only if you don’t specify the dimensions explicitly. In general, There shouldn’t be any performance/storage implications in general, but IIRC, there are some corner cases when 2 rows wouldn’t get merged even if they were supposed to.
In general, it is a good idea to specify your dimensions explicitly, if you can’t do that then use the dimension-less feature.
BTW, I typed too soon on cardinality aggregator. Even if you put id in the dimension columns, HLL will be able to do the merges for rows with same ids(and other dimensions) ,so in theory will still be faster than cardinality aggregator.
Thanks for the response Himanshu. Yeah we’re basically looking at either schema-less OR index HLL to solve one of a couple of problems, and I was gonna say even with such a high cardinality field as user ids HLL is performing a lot better than the cardinality aggregator.
*while keeping the user ids
I think that if you specify a column as a metric and leave the list of dimensions empty, that same column won’t be include as a dimension by default. One workaround is to have a duplicated column in your raw data that has a different column name, although I think we should instrument a fix on the Druid side.
@Fangjin, is the fix available for this now? I want to use the same column in metrics as well as dimension.