Tasks fail on SocketTimeoutException

Hi,

I’m running druid 0.13 and every now and then tasks fail on SocketTimeoutException.

It only happens to tasks we index using EMR and it happens to around 2% of our tasks, this issue started a few days ago.

The tasks don’t start on EMR, they fail before. It looks like it fails to run the job on EMR and gets a timeout from EMR even though everything is accessible and other tasks start successfully.

2019-06-15T09:01:17,322 ERROR [task-runner-0-priority-0] org.apache.druid.indexing.common.task.HadoopIndexTask - Encountered exception in HadoopIndexGeneratorInnerProcessing.
java.lang.RuntimeException: java.net.SocketTimeoutException: Read timed out
        at org.apache.druid.indexer.IndexGeneratorJob.run(IndexGeneratorJob.java:226) ~[druid-indexing-hadoop-0.13.0-incubating.jar:0.13.0-incubating]
        at org.apache.druid.indexer.JobHelper.runJobs(JobHelper.java:376) ~[druid-indexing-hadoop-0.13.0-incubating.jar:0.13.0-incubating]
        at org.apache.druid.indexer.HadoopDruidIndexerJob.run(HadoopDruidIndexerJob.java:96) ~[druid-indexing-hadoop-0.13.0-incubating.jar:0.13.0-incubating]
        at org.apache.druid.indexing.common.task.HadoopIndexTask$HadoopIndexGeneratorInnerProcessingRunner.runTask(HadoopIndexTask.java:612) [druid-indexing-service-0.13.0-incubating.jar:0.13.0-incubating]
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_212]
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_212]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_212]
        at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_212]
        at org.apache.druid.indexing.common.task.HadoopIndexTask.runInternal(HadoopIndexTask.java:380) [druid-indexing-service-0.13.0-incubating.jar:0.13.0-incubating]
        at org.apache.druid.indexing.common.task.HadoopIndexTask.run(HadoopIndexTask.java:232) [druid-indexing-service-0.13.0-incubating.jar:0.13.0-incubating]
        at org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner$SingleTaskBackgroundRunnerCallable.call(SingleTaskBackgroundRunner.java:421) [druid-indexing-service-0.13.0-incubating.jar:0.13.0-incubating]
        at org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner$SingleTaskBackgroundRunnerCallable.call(SingleTaskBackgroundRunner.java:393) [druid-indexing-service-0.13.0-incubating.jar:0.13.0-incubating]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_212]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_212]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_212]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_212]
Caused by: java.net.SocketTimeoutException: Read timed out
        at java.net.SocketInputStream.socketRead0(Native Method) ~[?:1.8.0_212]
        at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) ~[?:1.8.0_212]
        at java.net.SocketInputStream.read(SocketInputStream.java:171) ~[?:1.8.0_212]
        at java.net.SocketInputStream.read(SocketInputStream.java:141) ~[?:1.8.0_212]
        at sun.security.ssl.InputRecord.readFully(InputRecord.java:465) ~[?:1.8.0_212]
        at sun.security.ssl.InputRecord.read(InputRecord.java:503) ~[?:1.8.0_212]
        at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:975) ~[?:1.8.0_212]
        at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1367) ~[?:1.8.0_212]
        at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1395) ~[?:1.8.0_212]
        at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1379) ~[?:1.8.0_212]
        at org.apache.http.conn.ssl.SSLSocketFactory.createLayeredSocket(SSLSocketFactory.java:573) ~[httpclient-4.5.3.jar:4.5.3]
        at org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:557) ~[httpclient-4.5.3.jar:4.5.3]
        at org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:414) ~[httpclient-4.5.3.jar:4.5.3]
        at org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:180) ~[httpclient-4.5.3.jar:4.5.3]
        at org.apache.http.impl.conn.AbstractPoolEntry.open(AbstractPoolEntry.java:144) ~[httpclient-4.5.3.jar:4.5.3]
        at org.apache.http.impl.conn.AbstractPooledConnAdapter.open(AbstractPooledConnAdapter.java:134) ~[httpclient-4.5.3.jar:4.5.3]
        at org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:610) ~[httpclient-4.5.3.jar:4.5.3]
        at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:445) ~[httpclient-4.5.3.jar:4.5.3]
        at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:835) ~[httpclient-4.5.3.jar:4.5.3]
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) ~[httpclient-4.5.3.jar:4.5.3]
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56) ~[httpclient-4.5.3.jar:4.5.3]
        at org.jets3t.service.impl.rest.httpclient.RestStorageService.performRequest(RestStorageService.java:334) ~[jets3t-0.9.0.jar:?]
        at org.jets3t.service.impl.rest.httpclient.RestStorageService.performRequest(RestStorageService.java:281) ~[jets3t-0.9.0.jar:?]
        at org.jets3t.service.impl.rest.httpclient.RestStorageService.performRestHead(RestStorageService.java:942) ~[jets3t-0.9.0.jar:?]
        at org.jets3t.service.impl.rest.httpclient.RestStorageService.getObjectImpl(RestStorageService.java:2148) ~[jets3t-0.9.0.jar:?]
        at org.jets3t.service.impl.rest.httpclient.RestStorageService.getObjectDetailsImpl(RestStorageService.java:2075) ~[jets3t-0.9.0.jar:?]
        at org.jets3t.service.StorageService.getObjectDetails(StorageService.java:1093) ~[jets3t-0.9.0.jar:?]
        at org.jets3t.service.StorageService.getObjectDetails(StorageService.java:548) ~[jets3t-0.9.0.jar:?]
        at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:174) ~[hadoop-aws-2.7.3.jar:?]
        at sun.reflect.GeneratedMethodAccessor55.invoke(Unknown Source) ~[?:?]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_212]
        at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_212]
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191) ~[hadoop-common-2.7.3.jar:?]
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) ~[hadoop-common-2.7.3.jar:?]
        at org.apache.hadoop.fs.s3native.$Proxy213.retrieveMetadata(Unknown Source) ~[?:?]
        at org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:476) ~[hadoop-aws-2.7.3.jar:?]
        at org.apache.hadoop.fs.Globber.getFileStatus(Globber.java:57) ~[hadoop-common-2.7.3.jar:?]
        at org.apache.hadoop.fs.Globber.glob(Globber.java:252) ~[hadoop-common-2.7.3.jar:?]
        at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1676) ~[hadoop-common-2.7.3.jar:?]
        at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:294) ~[hadoop-mapreduce-client-core-2.7.3.jar:?]
        at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265) ~[hadoop-mapreduce-client-core-2.7.3.jar:?]
        at org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:217) ~[hadoop-mapreduce-client-core-2.7.3.jar:?]
        at org.apache.hadoop.mapreduce.lib.input.DelegatingInputFormat.getSplits(DelegatingInputFormat.java:115) ~[hadoop-mapreduce-client-core-2.7.3.jar:?]
        at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301) ~[hadoop-mapreduce-client-core-2.7.3.jar:?]
        at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318) ~[hadoop-mapreduce-client-core-2.7.3.jar:?]
        at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196) ~[hadoop-mapreduce-client-core-2.7.3.jar:?]
        at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290) ~[hadoop-mapreduce-client-core-2.7.3.jar:?]
        at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287) ~[hadoop-mapreduce-client-core-2.7.3.jar:?]
        at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_212]
        at javax.security.auth.Subject.doAs(Subject.java:422) ~[?:1.8.0_212]
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698) ~[hadoop-common-2.7.3.jar:?]
        at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287) ~[hadoop-mapreduce-client-core-2.7.3.jar:?]
        at org.apache.druid.indexer.IndexGeneratorJob.run(IndexGeneratorJob.java:206) ~[druid-indexing-hadoop-0.13.0-incubating.jar:0.13.0-incubating]
        ... 15 more

``

I tried to change the jobProperties and increase the client timeout with no help.

Any suggestions on how to resolve this issue would be appreciated.

Thanks,

Shachar

Hey Shachar,

The stack trace is in “RestStorageService” code, which is talking to S3. It is probably some transient flakiness. You might find that things are more reliable if you switch to s3a instead of s3n. (s3a uses the official Amazon S3 SDK rather than jets3t.)

Hi Gian,

Thanks for your help!

Unfortunately, EMR does not support s3a.

Best regards,

Shachar