Druid Native Tasks Fail Occasionally

Hi all,

Our native tasks fail occasionally with the following exception:

2019-06-05T17:52:44,338 ERROR [task-runner-0-priority-0] org.apache.druid.indexing.common.task.IndexTask - Encountered exception in BUILD_SEGMENTS.
java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.io.IOException: java.lang.NullPointerException
at org.apache.druid.data.input.impl.prefetch.Fetcher.checkFetchException(Fetcher.java:200) ~[druid-api-0.13.0-incubating.jar:0.13.0-incubating]
at org.apache.druid.data.input.impl.prefetch.Fetcher.openObjectFromLocal(Fetcher.java:221) ~[druid-api-0.13.0-incubating.jar:0.13.0-incubating]
at org.apache.druid.data.input.impl.prefetch.Fetcher.next(Fetcher.java:174) ~[druid-api-0.13.0-incubating.jar:0.13.0-incubating]
at org.apache.druid.data.input.impl.prefetch.PrefetchableTextFilesFirehoseFactory$2.next(PrefetchableTextFilesFirehoseFactory.java:223) ~[druid-api-0.13.0-incubating.jar:0.13.0-incubating]
at org.apache.druid.data.input.impl.prefetch.PrefetchableTextFilesFirehoseFactory$2.next(PrefetchableTextFilesFirehoseFactory.java:209) ~[druid-api-0.13.0-incubating.jar:0.13.0-incubating]
at org.apache.druid.data.input.impl.FileIteratingFirehose.getNextLineIterator(FileIteratingFirehose.java:90) ~[druid-api-0.13.0-incubating.jar:0.13.0-incubating]
at org.apache.druid.data.input.impl.FileIteratingFirehose.hasMore(FileIteratingFirehose.java:67) ~[druid-api-0.13.0-incubating.jar:0.13.0-incubating]
at org.apache.druid.indexing.common.task.IndexTask.generateAndPublishSegments(IndexTask.java:997) ~[druid-indexing-service-0.13.0-incubating.jar:0.13.0-incubating]
at org.apache.druid.indexing.common.task.IndexTask.run(IndexTask.java:466) [druid-indexing-service-0.13.0-incubating.jar:0.13.0-incubating]
at org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner$SingleTaskBackgroundRunnerCallable.call(SingleTaskBackgroundRunner.java:421) [druid-indexing-service-0.13.0-incubating.jar:0.13.0-incubating]
at org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner$SingleTaskBackgroundRunnerCallable.call(SingleTaskBackgroundRunner.java:393) [druid-indexing-service-0.13.0-incubating.jar:0.13.0-incubating]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_212]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_212]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_212]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_212]
Caused by: java.util.concurrent.ExecutionException: java.io.IOException: java.lang.NullPointerException
at java.util.concurrent.FutureTask.report(FutureTask.java:122) ~[?:1.8.0_212]
at java.util.concurrent.FutureTask.get(FutureTask.java:206) ~[?:1.8.0_212]
at org.apache.druid.data.input.impl.prefetch.Fetcher.checkFetchException(Fetcher.java:188) ~[druid-api-0.13.0-incubating.jar:0.13.0-incubating]
… 14 more
Caused by: java.io.IOException: java.lang.NullPointerException
at org.apache.druid.data.input.impl.prefetch.FileFetcher.download(FileFetcher.java:109) ~[druid-api-0.13.0-incubating.jar:0.13.0-incubating]
at org.apache.druid.data.input.impl.prefetch.Fetcher.fetch(Fetcher.java:135) ~[druid-api-0.13.0-incubating.jar:0.13.0-incubating]
at org.apache.druid.data.input.impl.prefetch.Fetcher.lambda$fetchIfNeeded$0(Fetcher.java:111) ~[druid-api-0.13.0-incubating.jar:0.13.0-incubating]
… 4 more
Caused by: java.lang.NullPointerException
at org.apache.druid.data.input.impl.prefetch.FileFetcher.lambda$download$0(FileFetcher.java:97) ~[druid-api-0.13.0-incubating.jar:0.13.0-incubating]
at org.apache.druid.java.util.common.RetryUtils.retry(RetryUtils.java:86) ~[java-util-0.13.0-incubating.jar:0.13.0-incubating]
at org.apache.druid.java.util.common.RetryUtils.retry(RetryUtils.java:125) ~[java-util-0.13.0-incubating.jar:0.13.0-incubating]
at org.apache.druid.data.input.impl.prefetch.FileFetcher.download(FileFetcher.java:95) ~[druid-api-0.13.0-incubating.jar:0.13.0-incubating]
at org.apache.druid.data.input.impl.prefetch.Fetcher.fetch(Fetcher.java:135) ~[druid-api-0.13.0-incubating.jar:0.13.0-incubating]
at org.apache.druid.data.input.impl.prefetch.Fetcher.lambda$fetchIfNeeded$0(Fetcher.java:111) ~[druid-api-0.13.0-incubating.jar:0.13.0-incubating]
… 4 more

``


Our middlemanagers run only native tasks, we have 7 instances which has 8 cores and 32GB RAM and we run 6 workers per host.
We get this exception for 1%-2% of the tasks we run locally (native) which is pretty significant for us.

Druid version: 0.13

Let me know if you need more info regarding this issue.
Any help will be greatly appreciated.

Thanks,
Shachar

Forgot to mention we found that there is a correlation between this error message and the middle-managers CPU usage.

It happens only when a middle-manager reaches high CPU usage (95%-100%).

Our middlemanagers are very loaded for some reason, even though we only run 6 processes on each one of them.

Any help will be appreciated.

Shachar

Still having this issue

Would appreciate any kind of help regarding this issue

Hi Shachar,

I’m looking at this issue. Would you please tell me where that task tried to read input files from?

Jihoon

Hi Jihoon,

The task tried to read input from s3.

Thanks for your help,

Shachar

We’re also facing the issue