Druid 0.11.0 - Middle Managers - NullPointerException Jackson

Hi guys,

We are running Druid 0.11.0 on production. Every hour we run approximately 200 tasks on middle managers.

Every few hours one or two real time index tasks are failing with the same error:

2017-12-19T12:01:09,813 INFO [main] io.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking start method[public void io.druid.curator.discovery.ServerDiscoverySelector.start() throws java.lang.Exception] on object[io.druid.curator.discovery.ServerDiscoverySelector@718fd7c1].
2017-12-19T12:01:09,814 INFO [main-SendThread(172.30.103.97:2181)] org.apache.zookeeper.ClientCnxn - Socket connection established to 172.30.103.97/172.30.103.97:2181, initiating session
2017-12-19T12:01:09,837 INFO [main-SendThread(172.30.103.97:2181)] org.apache.zookeeper.ClientCnxn - Session establishment complete on server 172.30.103.97/172.30.103.97:2181, sessionid = 0x160573708191987, negotiated timeout = 30000
2017-12-19T12:01:09,841 INFO [main-EventThread] org.apache.curator.framework.state.ConnectionStateManager - State change: CONNECTED
2017-12-19T12:01:09,936 ERROR [main] io.druid.cli.CliPeon - Error when starting up. Failing.
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_112]
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_112]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_112]
at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_112]
at io.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler.start(Lifecycle.java:414) ~[java-util-0.11.0.jar:0.11.0]
at io.druid.java.util.common.lifecycle.Lifecycle.start(Lifecycle.java:311) ~[java-util-0.11.0.jar:0.11.0]
at io.druid.guice.LifecycleModule$2.start(LifecycleModule.java:156) ~[druid-api-0.11.0.jar:0.11.0]
at io.druid.cli.GuiceRunnable.initLifecycle(GuiceRunnable.java:101) [druid-services-0.11.0.jar:0.11.0]
at io.druid.cli.CliPeon.run(CliPeon.java:283) [druid-services-0.11.0.jar:0.11.0]
at io.druid.cli.Main.main(Main.java:108) [druid-services-0.11.0.jar:0.11.0]
Caused by: java.lang.NullPointerException
at org.codehaus.jackson.JsonFactory.createJsonParser(JsonFactory.java:604) ~[jackson-core-asl-1.9.13.jar:1.9.13]
at org.codehaus.jackson.map.ObjectMapper.readValue(ObjectMapper.java:1973) ~[jackson-mapper-asl-1.9.13.jar:1.9.13]
at org.apache.curator.x.discovery.details.JsonInstanceSerializer.deserialize(JsonInstanceSerializer.java:86) ~[curator-x-discovery-4.0.0.jar:?]
at org.apache.curator.x.discovery.details.ServiceCacheImpl.addInstance(ServiceCacheImpl.java:193) ~[curator-x-discovery-4.0.0.jar:?]
at org.apache.curator.x.discovery.details.ServiceCacheImpl.start(ServiceCacheImpl.java:96) ~[curator-x-discovery-4.0.0.jar:?]
at org.apache.curator.x.discovery.details.ServiceProviderImpl.start(ServiceProviderImpl.java:75) ~[curator-x-discovery-4.0.0.jar:?]
at io.druid.curator.discovery.ServerDiscoverySelector.start(ServerDiscoverySelector.java:132) ~[druid-server-0.11.0.jar:0.11.0]
… 10 more

I checked overlord in hope to find more information about why this task failed but it was a dead, there was no errors related to this task on overlord.

N.B.: Here is my classpath $JAVA_OPTS -classpath /opt/druid/lib/*:/etc/druid/middleManager:/etc/druid/_common

/opt/druid/lib/* has the jar

/etc/druid/middleManager has the middle manager conf

/etc/druid/_common has the common configurations

Any help or pointer is appreciated!

Thanks

Unfortunately I don’t have a solution here but I can confirm that we see the very same issue
with druid 0.11.0, although we’ve seen this same problem also since druid 0.9.2.

We’re running about 500 indexing tasks per hour and probably see about 5 failures, so one in

one hundred times sounds about right. I tried being explicit with the classpath and putting the

jackson jars at the start of the list to see if that helped but it did not. My suspicion is that

occasionally the jackson jars in the hadoop-client are being preferred and then are not

compatible with the druid libs. The closest group post I have seen to it historically is:

https://groups.google.com/forum/#!topic/druid-user/0yda77Y3Va0

but the resolution to that is not helpful for those using a stock 0.11.0. I wonder if dropping

the hadoop extensions config would help but that would affect our batch ingestions. This

unreliability is the current biggest barrier to druid adoption for us so I would be more than

happy to assist with resolving this if someone could suggest some practical ways forward

please?!

Thanks and best wishes,

Daniel

Hi Daniel,

How did you come to the conclusion that you are using the jar of the hadoop dependencies for Jackson? In our case we do not import any hadoop dependencies, but I might have overlooked something.

Thanks,

Karim

Karim,

The hadoop link was a hunch based on the presence of jackson jars with different versions in that space.

However, about 10 hours ago I set:

druid.extensions.searchCurrentClassloader=false

on one middleManager and have seen no failures since on that node. I would normally have expected to have

seen a couple of failures by now. So not entirely conclusive but it looks like setting this to false might help.

I should know more certainly whether this solves this problem entirely within the next day or so.

Best wishes,

Daniel

No, sadly that was a false hope and the exceptions are still occurring. I merely got lucky for a while it seems…

Best wishes,

Daniel

This seems similar:

(but with no solution…)

Best wishes,

Daniel

My main hope for this issue is that some work for:

might magically make the problem go away. Failing that, the druid kafka indexing service

seems less susceptible to the problem, albeit for certain workloads tranquility seems to be

more performant.

Best wishes,

Daniel

Wanted to share an inflight-fix here. We’re also running around one thousand tasks per hour, and were seeing the exact same error trace as OP.

After a bit of digging, we localized the failure to a possible race condition within Curator service discovery. From what we can tell, access to the PathChildrenCache used by ServiceCacheImpl may not be fully thread-safe, and in rare cases, entries in the cache can be wiped to null by one thread, then subsequently read by another. It appears a null-check on the data from cache before use should handle this condition more gracefully. We’ve got a PR out on the Curator project: https://github.com/apache/curator/pull/261

We’ve got the fix applied on our cluster, and haven’t seen failed tasks since.

Hope this helps a bit - this issue seems to be incredibly rare, given how few conversations there seem to be around it.

Jason

Hi Jason,

Thank you for sharing. These failing tasks were getting annoying for us (especially because we are scaling up right now so the race condition is becoming more common).

I saw that the curator people merged a different patch for this bug (https://github.com/apache/curator/commit/cc30b67c8dbe24babd31ed654e9536fe776b8a18).

So on Sunday I cloned the curator repository, switched to the branch 4.0.0, manually applied this patch and compiled the project. I then swapped the curate-x-discovery jar in our middle managers and since then no errors (druid 0.11.0).

I tried to use the jars produced using the master branch but it would fail the indexing task at the very end.

Thank you for identifying this and submitting a fix on curator repo.

I hope my feedback will help others.

I met the same issue, but it is very very occasionally.