Storing string in Druid

Hello Druid community,

I am wondering if it is a best practice to store dimensions as strings in Druid.

For our dimensions, we’ve been storing numerical IDs (not sure why) that are associated to strings stored in various MySQL tables (i.e. for users, domain names, countries, etc).

After making Druid queries, we join the the dimension IDs to their respective names taken from MySQL.

Clearly, this isn’t the best solution and it is very limiting, as it needs to be done in out PHP layer.

I know there are inefficiencies with storing strings in Druid, so what would be the best solution in this case?

I would love to know how others handle this issue!

Thanks in advance!

  • Rafael Abbondanza

Hi Rafael,
You can achieve this use case via lookup approach.
Depending on how big is your SQL table, you can cache all keys-value inside the lookup cache.
Let me know if you need more help

Hey Slim,

Thanks for the response.

I can certainly look into this and see if it is doable with the data we have (which is a lot).

Just wondering, would you know the consequences of storing strings directly in Druid?

Best,

Rafael

In addition, it is important that we are able to sort these strings, is that possible with the lookup?

Yes! http://druid.io/docs/latest/querying/lookups.html has some more info

Storing the mapping in a lookup or storing in the raw data is best decided by the following:

Is the lookup mutable? and Do I always only want the most recent version of the lookup?

If the answer to both questions is “yes” then lookups are a good option.

If the answer to either question is “no” then you’ll want to consider storing the data in druid with the raw data.

The lookup makes druid treat the new value as if it were always the new value. The old value is masked. For example, numeric IDs translated to alpha-numeric descriptions will appear as if they were ALWAYS alpha-numeric descriptions… With the caveat that you cannot ask “What was the alpha-numeric description 6 months ago? Don’t look at what it is now.”

The down-side of storing the alpha-numeric description as part of the raw data is that it increases the size of the raw data slightly (due to the alpha-numeric strings being placed in every segment, usually along side the numeric ID as strings). This should only be an issue if you are running at the very threshold of disk space for your machines (in which case you may need more machines or larger drives). More likely you will be cpu or memory (page cache i/o) bound.

Cheers,

Charles Allen

ah… That makes sense!

This means that the sorting will occur AFTER we have looked up the values, right?

And, the sorting will happen on the values and not the keys?

Also, we have some immutable dimensions (Country IDs) and some mutable dimensions (Domain IDs), so would it be ideal to use a mixture of raw data and lookups?

Thanks a lot!

  • Rafael

ah… That makes sense!

This means that the sorting will occur AFTER we have looked up the values, right?

And, the sorting will happen on the values and not the keys?

Yes. This behavior should hold true for all dimension extraction functions.

Mixing is perfectly acceptable. You can also populate a domain dimension and an id dimension at ETL time, and elect at query time weather you want to use the mutable lookups via dimension extraction of id–>domain or use the original domain in the segment (one would be a query against an id dimension with a dimension extraction, the other would be against a domain dimension)

Great! I appreciate the answers…

You have been very helpful!