I am interested in using lookups in Druid, but I’m concerned with the scalability of the lookup tables. I am planning to put the data on http and reference it in my lookup definition. I did some searching, but I couldn’t find information on the limitations of the lookup file. I saw that issue #3996 on GitHub from 2017 talked about a 10MB file working - has this been improved upon to support larger files? I was also wondering if the number of rows matters as well. How would having a large number of entries (100,000 to 1,000,000) affect performance? What would be the maximum limit?
Lookups are loaded into heap memory/RAM, so the maximum lookup size depends on the amount of RAM available to your nodes. Druid cluster tuning documentation suggests adding 2 * total size of all loaded lookups to whatever the RAM available to your nodes would have been without lookups.
Hope that helps!
Thanks for your reply - that’s exactly what I was looking for!
I just had a couple follow up questions:
- Is the lookup table replicated for each node or is it stored in a single location? The documentation for globally cached lookups says that they draw from the same cache pool. Does this mean that just a single location is used to store the lookup tables?
- How does Druid calculate the size of the lookup table? For example, I am thinking of storing the lookup data in a csv with a key and value column - if the csv that I am referencing has a line with 8 characters, how many bytes would this be converted to?