Namespace lookup with extranctionFn question

I’ve been trying to use the namespaced lookup functionality with 0.8.1. After some configuration issues I struggled with (due to wrong example in the doc), it’s working now.

Basically, I added namespace lookup configuration with mysql table in broker, historical nodes (no real time node in my setup) and I set extractionFn in the query to specify lookup namespace. Druid replaces “id” dimension value with “name” in the db table. so far, so good.

But what I really want to do with this is to get both “id” and “name” in the query result so that I can use “id” to filter in the subsequent queries. The problem we have in our data set is that “id” is unique but “name” is not necessarily unique so we want to use “name” for display only but still use “id” for filtering.

Is there any way to get both “id” and “name” for dimension value in topN query ?

Hi Jaebin, what docs were wrong, are you able to submit a PR to fix them?

Regarding returning two “dimensions” in a topN (raw dimension and lookup dimension) that is not currently possible. It is possible to filter on the raw ID and use a lookup for the topN dimension that is returned.

So the question now becomes how do you get the desired association between the “id” and the “name” without having to know about the raw data.

Unfortunately the dataset you’ve described is a non-injective mapping you are wanting to treat as injective. So I see one of two solution groups:

  1. Make it injective
  2. If you are able to breakup the namespaced lookup into multiple injective lookups (I don’t know if this is possible with your dataset) that might solve your issues.
  3. You have the option of adding fun things like non-printing UTF-8 characters (like the zero-width-space) to the "name"s to make then unique if they only have to “look” the same.
  4. Find some sort of distinguishing characteristic for which you can guarantee injective properties. Example: If you know that IDs map to unique names within a particular metro zone, then any particular query that only covers one metro zone is injective and can use the big lookup with the raw name. For queries which cover multiple metro zones you can tell to use a different lookup definition, maybe one that pulls from a table (or view) where the metro zone is appended to the name.
  5. Change the query
  6. Do a group-by query, which will be much slower but will return any number of dimensions you want.
  7. Get the IDs back and handle the mapping in some sort of second query (either by querying the DB or issuing some kind of second query to Druid)… not really ideal

IMHO being able to return the original dimension value with the extractionFn modification would be a neat feature, but is not currently implemented.

Sorry I don’t have a good out-of-the-box solution, and those are just some initial thoughts.

Cheers,

Charles Allen

Thanks Charles for thoughtful answer.

Regarding the doc, I created an issue and b-slim created PR (https://github.com/druid-io/druid/pull/1854). It seems it got already merged.

Regarding the workaround you suggested, I’ve thought about the similar approach listed in “Make it injective” category (I don’t want to change the queries since it has performance implication) but it would be quirky workaround and make UI side complicated. (I was thinking to create “name” with concatenating “id” and real “name” and UI parses). Knowing that there is no way to get “id” and “name” together currently, I think it would be better to have this logic in API layer (proxy layer between UI and Druid).

Thanks for confirming this.

Jaebin