namespace lookup csv/tsv best practice (0.8.2)

Hi,

We’re going to use the namespace lookup functionality via TSV files read from S3. (At least at first, we will probably switch to a JDBC source at some point)

For our use case, we have a single dimension used as a key that gets mapped to a dozen or so lookup columns. So we have a single monolithic TSV with columns:

Something, SomethingElse, Key, L1, L2, L3, L4, … L12

(We are using a TSV but I’m writing it as a CSV here because it’s easier in the forum post)

So then we have a dozen different lookups set in config:

druid.query.extraction.namespace.lookups=\
[
{"type":"uri","namespace":"L1_lookup","uri":"s3://bucket/","namespaceParseSpec":{"format":"tsv","columns":["Something","SomethingElse","Key","L1","L2","L3",...,"L12"],"keyColumn":"Key","valueColumn":"L1"},"pollPeriod":"PT5M","versionRegex":".*\\\\.tsv"},\
{"type":"uri","namespace":"L2_lookup","uri":"s3://bucket/","namespaceParseSpec":{"format":"tsv","columns":["Something","SomethingElse","Key","L1","L2","L3",...,"L12"],"keyColumn":"Key","valueColumn":"L2"},"pollPeriod":"PT5M","versionRegex":".*\\\\.tsv"},\

…]


I'm wondering how this behaves when Druid reads the TSV. Will it read the entire TSV once for each of the 12 different lookups, scanning/parsing the file 12 times in order to pull out each different valueColumn? Is there any performance concern as the TSV file grows bigger - more rows and/or more columns?

I can imagine creating a separate TSV for each lookup that contains only the key and one valueColumn, but would rather not have to do that.

(We are using 0.8.2 but will likely upgrade to 0.9.0 before too long.)

Lookups are undergoing a lot of improvements and feature adds now and in the immediate future, so you’re asking the right kinds of questions.

the current implementation does exactly what you suggest, where it scans the file N times for N different lookups.

There has been suggestions in the community (see https://github.com/druid-io/druid/issues/2523 and https://github.com/druid-io/druid/pull/2524 for example) that being able to do a single scan of a data source to pull multiple lookups is desired.

Right now it does multiple scans, in the near future there should be a way to do one scan and get multiple lookups, but that feature is still relatively early in development.

Please document your use case in https://github.com/druid-io/druid/issues/2523

So, I just got a surprising S3 bill. Is it the case that when Druid polls an S3 lookup source (as governed by the “pollPeriod” config value), it does a full GET on the lookup files, rather than just checking timestamp? There’s no indication in the Druid logs that I can find. But based on the evidence of our S3 bill that sure seems to be what is happening.

If so, all the more reason to avoid multiple scans of the same file. Or better still, fix it to actually use the timestamps!