Cached Lookups - Cannot Update/Delete Bad Configs

Hi, I am working on a flavor of Globally Cached Lookups located at: maha/druid-lookups at master · yahoo/maha · GitHub

In our local flavor, some interesting behavior occurs:

  • When a lookup starts failing, a new lookup won’t replace the failed lookup definition unless the host’s Druid service is restarted.
    • ex: If you POST a JDBC-based lookup with a good config, update it via API or UI with a config with bad login credentials, and then try to POST the old working config… The “bad credentials entered” error will remain on the affected Druid hosts until the hosts’ Druid service is restarted.
    • ex: If you have a lookup with a bad config repeatedly polling and failing and attempt to post a DELETE on that lookup, the lookup will be removed from the list of lookups on that tier via both API and UI, but the Druid logs will continue to show the lookup attempting to run.
  • Repeatedly posting lookup updates can orphan the old lookup threads without killing them, even when DELETE is posted. Over time, this hogs all threads on the host.
    • ex: Running Druid locally with bin/start-micro-quickstart and repeatedly attempting to update the same lookup on historical or broker tier (via API POST at config/, config/[tier]/[lookup], as well as via UI) will eventually cause the Broker and Historical logs to stop outputting anything and the services to fail.

On my test bench, I’ve used Druid 0.17 and 0.21.1 getting identical results from both.

Is there somewhere in recent Druid updates, or our old flavor of the Lookup Extractor replicating this 6-year-old behavior which could cause this behavior to occur?

We override some behavior in lookups for our own purposes, some of which means we use some old behavior as such:
MahaLookupExtractorFactory.java

Which resembles this class in Druid:
NamespaceLookupExtractorFactory.java

Which was updated here:

Example intentionally faulty config which I used the DELETE endpoint to remove but is still being logged as active 5 days later (this exception posts once per 15M poll period):

2022-07-18T19:11:26,491 ERROR [MahaNamespaceExtractionCacheManager-72] com.yahoo.maha.maha_druid_lookups.server.lookup.namespace.cache.MahaNamespaceExtractionCacheManager - Failed update namespace [JDBCExtractionNamespace{connectorConfig=DbConnectorConfig{createTables=false, connectURI='jdbc:oracle:thin:@test', user='test_user', passwordProvider=org.apache.druid.metadata.DefaultPasswordProvider, dbcpProperties=null}, table='test.woeid', tsColumn='date', pollPeriod=PT15M, columnList=[woeid, value], primaryKeyColumn='woeid', cacheEnabled=true, lookupName='woeid_lookup_test', firstTimeCaching=true, previousLastUpdateTimestamp=null, kerberosProperties=null, tsColumnConfig=null, kerberosPropertiesEnabled=false}]
org.skife.jdbi.v2.exceptions.UnableToObtainConnectionException: java.sql.SQLRecoverableException: IO Error: Unknown host specified 
	at org.skife.jdbi.v2.DBI.open(DBI.java:230) ~[jdbi-2.63.1.jar:2.63.1]
	at org.skife.jdbi.v2.DBI.withHandle(DBI.java:279) ~[jdbi-2.63.1.jar:2.63.1]
	at com.yahoo.maha.maha_druid_lookups.server.lookup.namespace.JDBCExtractionNamespaceCacheFactory.lambda$getCachePopulator$3(JDBCExtractionNamespaceCacheFactory.java:75) ~[maha-druid-lookups-6.75.jar:?]
	at com.yahoo.maha.maha_druid_lookups.server.lookup.namespace.cache.MahaNamespaceExtractionCacheManager$3.run(MahaNamespaceExtractionCacheManager.java:332) [maha-druid-lookups-6.75.jar:?]
	at com.google.common.util.concurrent.MoreExecutors$ScheduledListeningDecorator$NeverSuccessfulListenableFutureTask.run(MoreExecutors.java:582) [guava-16.0.1.jar:?]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_282]
	at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [?:1.8.0_282]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_282]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [?:1.8.0_282]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_282]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_282]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_282]
Caused by: java.sql.SQLRecoverableException: IO Error: Unknown host specified 
	at oracle.jdbc.driver.T4CConnection.logon(T4CConnection.java:774) ~[ojdbc8-12.2.0.1.jar:12.2.0.1.0]
	at oracle.jdbc.driver.PhysicalConnection.connect(PhysicalConnection.java:688) ~[ojdbc8-12.2.0.1.jar:12.2.0.1.0]
	at oracle.jdbc.driver.T4CDriverExtension.getConnection(T4CDriverExtension.java:39) ~[ojdbc8-12.2.0.1.jar:12.2.0.1.0]
	at oracle.jdbc.driver.OracleDriver.connect(OracleDriver.java:691) ~[ojdbc8-12.2.0.1.jar:12.2.0.1.0]
	at java.sql.DriverManager.getConnection(DriverManager.java:664) ~[?:1.8.0_282]
	at java.sql.DriverManager.getConnection(DriverManager.java:247) ~[?:1.8.0_282]
	at org.skife.jdbi.v2.DBI$3.openConnection(DBI.java:140) ~[jdbi-2.63.1.jar:2.63.1]
	at org.skife.jdbi.v2.DBI.open(DBI.java:212) ~[jdbi-2.63.1.jar:2.63.1]
	... 11 more
Caused by: oracle.net.ns.NetException: Unknown host specified 
	at oracle.net.resolver.HostnameNamingAdapter.resolve(HostnameNamingAdapter.java:209) ~[ojdbc8-12.2.0.1.jar:12.2.0.1.0]
	at oracle.net.resolver.NameResolver.resolveName(NameResolver.java:131) ~[ojdbc8-12.2.0.1.jar:12.2.0.1.0]
	at oracle.net.resolver.AddrResolution.resolveAndExecute(AddrResolution.java:489) ~[ojdbc8-12.2.0.1.jar:12.2.0.1.0]
	at oracle.net.ns.NSProtocol.establishConnection(NSProtocol.java:660) ~[ojdbc8-12.2.0.1.jar:12.2.0.1.0]
	at oracle.net.ns.NSProtocol.connect(NSProtocol.java:286) ~[ojdbc8-12.2.0.1.jar:12.2.0.1.0]
	at oracle.jdbc.driver.T4CConnection.connect(T4CConnection.java:1438) ~[ojdbc8-12.2.0.1.jar:12.2.0.1.0]
	at oracle.jdbc.driver.T4CConnection.logon(T4CConnection.java:518) ~[ojdbc8-12.2.0.1.jar:12.2.0.1.0]
	at oracle.jdbc.driver.PhysicalConnection.connect(PhysicalConnection.java:688) ~[ojdbc8-12.2.0.1.jar:12.2.0.1.0]
	at oracle.jdbc.driver.T4CDriverExtension.getConnection(T4CDriverExtension.java:39) ~[ojdbc8-12.2.0.1.jar:12.2.0.1.0]
	at oracle.jdbc.driver.OracleDriver.connect(OracleDriver.java:691) ~[ojdbc8-12.2.0.1.jar:12.2.0.1.0]
	at java.sql.DriverManager.getConnection(DriverManager.java:664) ~[?:1.8.0_282]
	at java.sql.DriverManager.getConnection(DriverManager.java:247) ~[?:1.8.0_282]
	at org.skife.jdbi.v2.DBI$3.openConnection(DBI.java:140) ~[jdbi-2.63.1.jar:2.63.1]
	at org.skife.jdbi.v2.DBI.open(DBI.java:212) ~[jdbi-2.63.1.jar:2.63.1]
	... 11 more

Attempts to reproduce with a “good” config have failed, so this appears only to occur when the lookup itself is updated into a failing state.

Any help would be greatly appreciated!

Thanks,
Ryan Wagner

Welcome @ryankwagner! Taking a pretty general stab at this, but 0.22.0 introduced:

# Support using MariaDb connector with MySQL extensions

Druid MySQL extensions now supports using the MariaDB connector library as an alternative to the MySQL connector. This can be done by setting druid.metadata.mysql.driver.driverClassName to org.mariadb.jdbc.Driver and includes full support for JDBC URI parameter whitelists used by JDBC lookups and SQL based ingestion.

You might have better luck asking your question in the Apache Druid workspace.

Best,

Mark

@ryankwagner : Thanks for posting a detailed explanation of the issue. I had recently seen similar symptoms wrt lookups. The root cause in the case I observed was :

  1. At one point of time, there were a lot of bad lookup configs (~100) in the cluster. For all those lookups waitForFirstRunMs was a decent amount of time (atleast 10 minutes) and the refreshes for all lookups were around 45 minutes. Updating those lookups to correct configs > 2 days which seemed wrong.
  2. Updating even the good configs was taking too much time due to all the lookup loading threads being hogged by faled loading and retries over bad lookup configs.

To fix that, I had made the following change in 0.23 Fail fast incase a lookup load fails by rohangarg · Pull Request #12397 · apache/druid · GitHub
Does your problem also seem similar to you? There may well be more gaps in the code which you might be running into.

Regarding the DoS you mention using UPDATE on a lookup is something I didn’t run into in my exposure. My observation was that the default loading/update threads in NamespaceLookupExtractorFactory were 2 by default which bounded the parallelism of updates.

I hope that helps!

1 Like