Multiple dimension transformations in Druid

Hi there,
I was searching for functionality in druid to transform multiple dimensions/columns with one function, but I couldn’t find it. In SQL terms I would like to do simple thing like this:

SELECT x + y AS z …

And possibly even further:

… GROUP BY z

I found dimensionSpec in documentation: http://druid.io/docs/latest/querying/dimensionspecs.html

for “transforming dimension values”, but I can only specify single dimension here, not multiple of them.

Is such thing possible in Druid?

Thanks for your answers!

Krzysiek

Hi Krzysztof, instead of stating the SQL you wold like to emulate, can you state the problem question you’re trying to answer?

Hi Krzysztof Zarzycki, this is a Post-Aggregation operation in druid, you can check out here http://druid.io/docs/latest/querying/post-aggregations.html

but the post-aggregations only function with metrics.

for example:

SQL:

SELECT (unique_users / rows) as average_users_per_row FROM …

DRUID:

  "aggregations" : [{
    {"type" : "count", "name" : "rows"},
    {"type" : "hyperUnique", "name" : "unique_users", "fieldName" : "uniques"}
  }],
  "postAggregations" : [{
    "type"   : "arithmetic",
    "name"   : "average_users_per_row",
    "fn"     : "/",
    "fields" : [
      { "type" : "hyperUniqueCardinality", "fieldName" : "unique_users" },
      { "type" : "fieldAccess", "name" : "rows", "fieldName" : "rows" }
    ]
  }]

Cheers.

Charles, I meant to use SQL as a medium that most of us understand.
What I would like to acheive is to be able to transform not a single dimension, but multiple dimensions together with a single function. For example, I would like to calculate ratio of two dimensions or metrics, and then run group-by query, where the calculated ratio is my group-by dimension.

Alejandro, thanks for your proposition. Although post-aggregations are run after grouping and I would like to run the function prior to grouping. But I can use a fact that Druid allows to pass group-by query as a datasource, and run first a group-by that does nothing much but my transforming post-aggregation. And then run my group-by query on a result of it. But I still consider it a bit cumbersome.

Thanks for your answers.

Krzysztof

Thanks Krzystof!

For Druid dimensions are strings, and metrics are things that are not strings (and can usually be combined with other not-string-things in interesting ways).

I think you’re asking about the ability to group-by the results of some calculation on metrics. A simple example is to calculate the total number of successful events divided by the total number of events and bucket that result into ranges, thus grouping by some transformation of the raw data (ex: <10%, 10~50%, >50% ). Does that sound like something similar to what you’re looking for?

Charles,
Yes, that is one thing I would like to achieve (at least in terms of functionality). Is it possible with Druid?

Although, I can imagine transforming several dimensions as well: I know, there are strings for Druid, but they mean more to me:) In javascript I could interpret them in any way. Is that supported by Druid?

Thanks!

Krzysztof

The way we handle ranges internally is to have them specified during ETL prior to loading into Druid. So a double (or float) value would be bucketed during ETL, and instead of storing (or in addition to storing) the original value, the bucketed version is stored as a string dimension. Then you can do topN, groupBy or whatever else over the buckets.

Druid does not currently have a way to make a metric into a dimension that I’m aware of. The Having clause for group-by is about the closest thing.

Meaning that, at a minimum, you could issue multiple queries with different having clauses and make a composite on the client end.

I’m trying to achieve the same as Krzysztof,

are there any news regarding this? Isn’t it possible at all?

Thanks,

Multiple-dimension transformations aren’t currently supported, but there is some work being done in that direction, e.g.:

https://github.com/druid-io/druid/pull/3091

https://github.com/druid-io/druid/issues/3378

We are using expression virtual column to address that kind of problems (see https://github.com/druid-io/druid/issues/3021 if interested), something like

{
  "queryType" : "groupBy",
  "dataSource" : "testing",
  "intervals" : [ "1970-01-01T00:00:00.000Z/2020-01-01T00:00:00.000Z" ],
  "granularity" : "DAY",
  "virtualColumns" : [ {
    "type" : "expr",
    "expression" : "nvl(dim, 'NA')",
    "outputName" : "dim_nvl"
  } ],
  "dimensions" : [ "dim_nvl" ],
  "aggregations" : [
    { "type" : "sum", "name" : "sum_of_array", "fieldName" : "array", "inputType" : "array.long" },
    { "type" : "min", "name" : "min_of_array", "fieldName" : "array", "inputType" : "array.long" },
    { "type" : "max", "name" : "max_of_array", "fieldName" : "array", "inputType" : "array.long" }
  ],
  "having": { "expression": "max > min * 3" },
  "limitSpec" : {
    "windowingSpecs" : null
  }
}

But I can't expect it would be ever committed to master branch.

Hi Navis,

We understand your frustration but please understand that our own PRs have gone months without review. We hope that the other Druid committers can help do careful review on pull requests so we can reduce the backlog, but without assistance, we have to prioritize PR reviews. As a part of https://github.com/druid-io/druid/pull/2511, there were concerns brought up with the implementation that were not addressed. As Druid committers, we strive to implement solutions in a way that is general and extendable for the future.

There is a proposal around https://github.com/druid-io/druid/issues/3378 that I think will support some of these use cases in a more general way. Did you have a chance to review that proposal?

Best,

FJ

I didn't meant to show frustration but it's rather a concern on the cost of rebasing our patches on master branch in near future. We've done +100 patches over druid-0.9.1 already and recent works are even cannot be back ported (or PRed) to master. And I've seen cases some PRs we've done are reconstructed with more "generalized" way. I can heartfully understand that but it results huge conflicts on our code base and makes us hesitate to make further proposals.

And for https://github.com/druid-io/druid/issues/3378, I've read that and concluded it would be better to stick to our way because our queries are already made and published to customers with that. So generally it looks good but it's not our way to follow. And it still cannot solve some use cases we are using.

Thanks,
Navis

Hi Navis, we definitely understand the concern. I think the best way to avoid this problem is to submit proposals similar to https://github.com/druid-io/druid/issues/3378 before code has been written to let the community know what is being worked on so everyone can coordinate. If there are disagreements with the proposal, or alternative architectures to consider, we highly encourage bringing them up in the proposal so that we can openly discuss issues. Overall, we strive to follow the ASF model of reviewing proposals and PRs.

For example, see https://github.com/druid-io/druid/issues/3147. There was a full discussion on the scope of changes being made, and the final PR https://github.com/druid-io/druid/pull/3148 incorporated the feedback that was given from the proposal.

We are working on a plan to best determine how to merge all of your PRs into trunk. I think the most important thing for us is to understand which proposals are high priority and critical for you guys, and which proposals are less critical. It also helps us a lot to know which PRs have been validated in production, and which ones have not.

For https://github.com/druid-io/druid/issues/3378, it would be great to understand, from an architecture and implementation perspective, your concerns on the direction and proposed implementation. I think as Druid committers we should strive to make the best architecture decisions possible and ensure the project follows a good direction. For your particular set of changes related to #3378, we can also discuss the best way that we can incorporate both sets of work so you can continue to use your contribution in that proposal.

I understand that this process is a bit frustrating. The main problem is that there is an extremely small number of committers that actually review code as everyone else wants to build features instead. I hope that as committers, we can divide time equally between code review and feature development.

Does this sound reasonable?

Best,

FJ