How to get all dimensions that were at least once indexed together with a particular dimension?

Hey,

I’m using this http://druid.io/docs/latest/querying/datasourcemetadataquery.html to identify suspicious dimensions by cardinality and size.

However I cannot figure out how to associate these dimensions to client dimensions for us to be able to report only relevant dimensions to a client.

I guess I have to issue an extra query for each client dimension value and collect all dimensions that it has ever indexed.

Please, what is the most efficient way to do that?

I think you meant http://druid.io/docs/latest/querying/segmentmetadataquery.html instead.
I am not sure if i understood the question correctly, Could you elaborate more on exactly what you want to achieve by some example ?

Hi Nishant,

no, I meant datasourcemetadataquery, it gives me what I need, cardinality of dimensions in a certain interval.

The example :

campaign “foo” has dynamic properties “d_bar, d_baz”,

campaign “xxx” has dynamic properties “d_fuz”.

^^ All are dimensions.

I simply want to identify properties with high cardinality which I can get from Data Source Metadata Queries,

but I don’t know how to effectively do : List all properties of a campaign “foo”.

sounds like an interesting question.

I’m not quite sure what you mean by “dynamic properties” or by “client dimensions”.

If you want to analyze which data within your datasource rolls up nicely and wich doesn’t you could do the following:

In your ingestion specs you could include the following measure:

"metricsSpec": [
   {"type": "count", "name": "count"},

This adds measure which contains the number of ingested records that went into each internal record stored in Druid. Usually data rolls up during ingestion because data is grouped on the dimensions you define and if several input records have the same dimensions, they get merged into a single record within Druid.

Once you have this measure, you can do three things with it:

  • you can use a longSum aggregator over this measure to get the number of input records that meet whatever query scenario you are in
  • you can use a count aggregator over this measure to get the number of records in druid for the given query scenario
  • you can form the ratio of the two (longSum / count) to get the rollup, a measure of how well the data rolled up

if you are using pivot, you could add the following measures to a datasource’s measures section of your pivot config file:

- name: ingestedRecordCount
  title: Ingested Record Count
  format: '0,0.00a'
  formula: $main.sum($count)

- name: cuboidCount
  title: Cuboid Count
  format: '0,0.00a'
  formula: $main.count()

- name: rollupRatio
  title: Rollup Ratio
  format: '0,0.00'
  formula: $main.sum($count) / $main.count()

If you are working with plyql, then the expressions to be used should be similar to the "formala" sections above.

You could also issue native Druid queries instead in which case you would define aggregators over the count measure and use longSum and count as the aggregation types.

The screenshot below shows what this would look like. You could define your query scenario and drill down into data that doesn't roll up nicely


The other two things we do is

  • issue cardinality queries to learn about which dimensions correlate and which don’t. This gives you an idea of how hurtful it is to the resulting overall datavolume to have added a new dimension.

Cardinality aggregators are described here: http://druid.io/docs/latest/querying/aggregations.html

Use the byRow=true variant

  • we parse the broker query logs which contain the incoming queries in json format. We then parse the dimensions and measures contained in each query out of the json. This is a bit of work because dimensions can appear everywhere, in groupbys/splits, in filter conditions and also in aggregator expressions.

so we’d end up with a table in which each query is a row and the columns represent which dimensions and which measures the query touched.

This can then allow you to analyze query patterns. It will give you a sense of which dimensions people query and which dimensions get queried together. Maybe this is what you were trying to achieve? If so, I believe there is no builtin way to get this. You need to process the query logs that Druid emits. You could even try to parse them and ingest the result back into Druid as a datasource.