[0.9.1.1] S3 read timeouts during batch ingestion tasks

Hello,

In our batch ingestions we experience read timeouts communicating with S3. This is pretty much the only time we see ingestion tasks fail.

My understanding is that some timeouts are to be expected when communicating with S3. Maybe the peons could do a better job handling them, but we can tolerate some low rate of failure - we have a daemon that watches for failed tasks and restarts them.

However, we’re trying to scale up our indexer capacity to deal with backfill situations, and observationally it appears that as we increase the number of peons running on a single VM, the rate of failures due to S3 read timeouts also increases. I can’t say this for sure or give you hard numbers - at this point, it’s just “anecdata”. But for example, I’ve been experimenting with running 20 peons on a 40-CPU m4.10xlarge, and for 27 tasks I have 14 failures.

Is there anything that can be done to reduce these failures? If not, does anyone have a sense what is a “healthy” number of workers to run in parallel?

java.lang.IllegalStateException: java.net.SocketTimeoutException: Read timed out
	at org.apache.commons.io.LineIterator.hasNext(LineIterator.java:106) ~[commons-io-2.4.jar:2.4]
	at io.druid.data.input.impl.FileIteratingFirehose.hasMore(FileIteratingFirehose.java:52) ~[druid-api-0.9.1.1.jar:0.9.1.1]
	at io.druid.indexing.common.task.IndexTask.generateSegment(IndexTask.java:389) ~[druid-indexing-service-0.9.1.1.jar:0.9.1.1]
	at io.druid.indexing.common.task.IndexTask.run(IndexTask.java:221) ~[druid-indexing-service-0.9.1.1.jar:0.9.1.1]
	at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:436) [druid-indexing-service-0.9.1.1.jar:0.9.1.1]
	at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:408) [druid-indexing-service-0.9.1.1.jar:0.9.1.1]
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_60]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_60]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_60]
	at java.lang.Thread.run(Thread.java:745) [?:1.8.0_60]
Caused by: java.net.SocketTimeoutException: Read timed out
	at java.net.SocketInputStream.socketRead0(Native Method) ~[?:1.8.0_60]
	at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) ~[?:1.8.0_60]
	at java.net.SocketInputStream.read(SocketInputStream.java:170) ~[?:1.8.0_60]
	at java.net.SocketInputStream.read(SocketInputStream.java:141) ~[?:1.8.0_60]
	at sun.security.ssl.InputRecord.readFully(InputRecord.java:465) ~[?:1.8.0_60]
	at sun.security.ssl.InputRecord.readV3Record(InputRecord.java:593) ~[?:1.8.0_60]
	at sun.security.ssl.InputRecord.read(InputRecord.java:532) ~[?:1.8.0_60]
	at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973) ~[?:1.8.0_60]
	at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:930) ~[?:1.8.0_60]
	at sun.security.ssl.AppInputStream.read(AppInputStream.java:105) ~[?:1.8.0_60]
	at org.apache.http.impl.io.AbstractSessionInputBuffer.read(AbstractSessionInputBuffer.java:198) ~[httpcore-4.4.3.jar:4.4.3]
	at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:178) ~[httpcore-4.4.3.jar:4.4.3]
	at org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:137) ~[httpclient-4.5.1.jar:4.5.1]
	at org.jets3t.service.io.InterruptableInputStream.read(InterruptableInputStream.java:78) ~[jets3t-0.9.4.jar:0.9.4]
	at org.jets3t.service.impl.rest.httpclient.HttpMethodReleaseInputStream.read(HttpMethodReleaseInputStream.java:146) ~[jets3t-0.9.4.jar:0.9.4]
	at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284) ~[?:1.8.0_60]
	at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326) ~[?:1.8.0_60]
	at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178) ~[?:1.8.0_60]
	at java.io.InputStreamReader.read(InputStreamReader.java:184) ~[?:1.8.0_60]
	at java.io.BufferedReader.fill(BufferedReader.java:161) ~[?:1.8.0_60]
	at java.io.BufferedReader.readLine(BufferedReader.java:324) ~[?:1.8.0_60]
	at java.io.BufferedReader.readLine(BufferedReader.java:389) ~[?:1.8.0_60]
	at org.apache.commons.io.LineIterator.hasNext(LineIterator.java:95) ~[commons-io-2.4.jar:2.4]
	... 9 more

Bump… we’ve had some success with scaling up with small instances running only a single peon each. But even there it seems when we cross some threshold of total workers everything starts failing out with S3 timeouts.

We’ve seen that happen some time when our configuration hasn’t changed too - we were chugging along with 7 single-peon indexers, then one hour had like a 75% failure rate, and the next hour a 100% failure rate, then without any changes on our side, it went back down to ~5% for a while.

Anyone have any insight, or maybe suggest a place we can put our data other than S3 that is more robust?

Hey Ryan,