[9.1.1] add/update globally cached lookup -> multiple delete/reload of all lookups on historicals

Running druid 9.1.1 - we’re seeing behavior that seems like it might be a bug, or if not a bug, something that is begging to be optimized.

The behavior is similar to that that is described here: https://groups.google.com/forum/#!msg/druid-user/2WOGWojpJTc/JZlX6N43CAAJ

Say we have previously established three dynamic lookups. If we update, the logs on a historical node look like this:

2017-09-07T15:51:11,662 INFO [qtp1878677242-36] io.druid.server.lookup.namespace.cache.NamespaceExtractionCacheManager - Deleting namespace [6b885cb8-638a-4ba7-91da-3226acf3600f]

2017-09-07T15:51:11,662 INFO [qtp1878677242-36] io.druid.server.lookup.namespace.cache.NamespaceExtractionCacheManager - Deleting namespace [719fa0cf-8113-440d-bb63-3142d7c5fad1]

2017-09-07T15:51:11,663 INFO [qtp1878677242-36] io.druid.server.lookup.namespace.cache.NamespaceExtractionCacheManager - Deleting namespace [9fd1ee88-3b4a-4d68-be41-f4a009e13239]

2017-09-07T15:51:22,486 INFO [NamespaceExtractionCacheManager-0] io.druid.server.lookup.namespace.JDBCExtractionNamespaceCacheFactory - Finished loading 96539 values for namespace[5b4919f5-89cd-4c03-a5da-fb92c825d34a]

2017-09-07T15:51:22,777 INFO [NamespaceExtractionCacheManager-0] io.druid.server.lookup.namespace.JDBCExtractionNamespaceCacheFactory - Finished loading 6 values for namespace[4d5a63f5-5355-4b78-a17e-c8d3860947e9]

2017-09-07T15:51:35,528 INFO [qtp1878677242-39] io.druid.server.lookup.namespace.cache.NamespaceExtractionCacheManager - Deleting namespace [5b4919f5-89cd-4c03-a5da-fb92c825d34a]

2017-09-07T15:51:35,528 INFO [qtp1878677242-39] io.druid.server.lookup.namespace.cache.NamespaceExtractionCacheManager - Deleting namespace [4d5a63f5-5355-4b78-a17e-c8d3860947e9]

2017-09-07T15:51:35,529 INFO [qtp1878677242-39] io.druid.server.lookup.namespace.cache.NamespaceExtractionCacheManager - Deleting namespace [40ed9546-eb62-486b-9fc9-3fbf6577672b]

2017-09-07T15:51:35,529 ERROR [qtp1878677242-39] io.druid.query.lookup.LookupReferencesManager - Error closing [attr_def:192]:[io.druid.query.lookup.NamespaceLookupExtractorFactory@419d6610]

2017-09-07T15:51:36,110 INFO [NamespaceExtractionCacheManager-0] io.druid.server.lookup.namespace.JDBCExtractionNamespaceCacheFactory - Finished loading 96066 values for namespace[40ed9546-eb62-486b-9fc9-3fbf6577672b]

2017-09-07T15:51:46,103 INFO [NamespaceExtractionCacheManager-0] io.druid.server.lookup.namespace.JDBCExtractionNamespaceCacheFactory - Finished loading 96539 values for namespace[b9484f58-3098-4a03-a43b-035905800946]

2017-09-07T15:51:46,320 INFO [NamespaceExtractionCacheManager-0] io.druid.server.lookup.namespace.JDBCExtractionNamespaceCacheFactory - Finished loading 6 values for namespace[d1007354-79e5-4fb7-9bcc-7b0ac647ee34]

2017-09-07T15:51:50,782 INFO [qtp1878677242-53] io.druid.server.lookup.namespace.cache.NamespaceExtractionCacheManager - Deleting namespace [b9484f58-3098-4a03-a43b-035905800946]

2017-09-07T15:51:50,782 INFO [qtp1878677242-53] io.druid.server.lookup.namespace.cache.NamespaceExtractionCacheManager - Deleting namespace [d1007354-79e5-4fb7-9bcc-7b0ac647ee34]

2017-09-07T15:51:50,782 INFO [qtp1878677242-53] io.druid.server.lookup.namespace.cache.NamespaceExtractionCacheManager - Deleting namespace [b2dc4f23-1aa7-48a8-b28c-a0f52ff410df]

2017-09-07T15:51:50,782 ERROR [qtp1878677242-53] io.druid.query.lookup.LookupReferencesManager - Error closing [attr_def:192]:[io.druid.query.lookup.NamespaceLookupExtractorFactory@6a5dfb5f]

2017-09-07T15:51:56,662 INFO [NamespaceExtractionCacheManager-0] io.druid.server.lookup.namespace.JDBCExtractionNamespaceCacheFactory - Finished loading 96066 values for namespace[b2dc4f23-1aa7-48a8-b28c-a0f52ff410df]

2017-09-07T15:52:07,537 INFO [NamespaceExtractionCacheManager-0] io.druid.server.lookup.namespace.JDBCExtractionNamespaceCacheFactory - Finished loading 96539 values for namespace[c75990ae-2d57-491b-8ab9-eff8815d641e]

2017-09-07T15:52:07,748 INFO [NamespaceExtractionCacheManager-0] io.druid.server.lookup.namespace.JDBCExtractionNamespaceCacheFactory - Finished loading 6 values for namespace[be8825db-088f-4df2-b17f-840ef07582e9]

2017-09-07T15:52:18,100 INFO [NamespaceExtractionCacheManager-0] io.druid.server.lookup.namespace.JDBCExtractionNamespaceCacheFactory - Finished loading 96066 values for namespace[8ce75c48-7097-464d-b8ad-420ae666d815]

This log example is not entirely clean or clear… but from what we can tell this is what happens:

  1. All namespaces are deleted.

  2. All namespaces are loaded.

  3. It looks like some deletes happen before the targeted namespace is loaded resulting an an “Error closing” message - look at 40ed9546-eb62-486b-9fc9-3fbf6577672b in the example.

  4. This cycle can occur multiple times after a single add or update. (perhaps because of #3 - does that trigger a delete/reload when encountered?)

We observe pretty much the same thing when we add a new namespace - if we add a fifth namespace, we see four deletes followed by five loads.

The consequences of this behavior are that when we serve queries targeting the namespaces after they have been deleted but before they have been reloaded, we get “NamespaceVanished” exceptions. That would be tolerable if it were happening only to the one namespace being updated. But because it happens across all namespaces it can take tens of seconds just with the small set we have so far. And we are expecting to have hundreds of namespaces when we scale this feature across all our customers, so we’re looking at possibly multiple minutes of those “NamespaceVanished” exceptions. This is a severe enough problem that if we can’t figure out how to mitigate it in some way, we may not be able to use this feature as planned.

If each namespace were deleted then reloaded rather than doing all deletes followed by all loads, the hole would be much smaller. And of course, it seems like there should be no reason to delete/reload the namespaces that were actually touched by the update or insert.

Guessing here, but in the log output the UIDs change every time we go through that cycle - does an insert or update cause all the ids to be recreated so that it appears that all previous namespaces were dropped?

Apologies for the bump… but any ideas about this? If it’s really a bug, I’d be happy to log it.

One last bump to see if anyone has anything to say…

Another implication of the way this works is that you cannot effectively stagger the polling periods on different lookups. I.e., if I set one lookup with a 30 minute polling period, and then add another one 15 minutes later also with a 30 minute polling period, I’d like to see the polling happening 15 minutes apart. But because the first lookup is torn down and replaced they’ll all end up polling on the same schedule. And multiplied by the number of Druid nodes in play, that can be quite a hammering every time the polling period comes due.

If your data source is robust that won’t matter, but if your lookups are coming from some external service which is not scaled appropriate to the volume of all the lookup requests happening at once… you may be sad especially if you release to production, start setting lookups and knock over that service that maybe other people are also relying on for production products. Um, in theory anyway…

Bump

Bump