Historical node fails to load segments from S3

Hi,

After upgrading to 0.7.1.1, I’m seeing a weird, non-descriptive exception on the historical node, on a completely new cluster (no old data). Basically, the node fails to load segments from S3. See the stack trace below. I’ve verified that the files exist on S3, access rights are correct, etc. The realtime node is able to write the files on S3.

The error messages are not helping: “io.druid.segment.loading.SegmentLoadingException: S3 fail!” and “org.jets3t.service.S3ServiceException: Request Error.” I took a look at jets3t code where the error gets generated and as far as I can tell, the error message is empty if S3 doesn’t return an XML in the body (so, an empty response?).

If any has any ideas how to debug and look further into this, help is much appreciated!

2015-04-16T15:57:45,869 ERROR [ZkCoordinator-0] io.druid.server.coordination.ZkCoordinator - Failed to load segment for dataSource: {class=io.druid.server.coordination.ZkCoordinator, exceptionType=class io.druid.segment.loading.SegmentLoadingException, exceptionMessage=Exception loading segment[clicks2_2015-04-16T15:55:00.000Z_2015-04-16T15:56:00.000Z_2015-04-16T15:55:00.000Z], segment=DataSegment{size=30890, shardSpec=NoneShardSpec, metrics=[count], dimensions=[omitted], version=‘2015-04-16T15:55:00.000Z’, loadSpec={type=s3_zip, bucket=druid-test1, key=clicks2/2015-04-16T15:55:00.000Z_2015-04-16T15:56:00.000Z/2015-04-16T15:55:00.000Z/0/index.zip}, interval=2015-04-16T15:55:00.000Z/2015-04-16T15:56:00.000Z, dataSource=‘clicks2’, binaryVersion=‘9’}}
io.druid.segment.loading.SegmentLoadingException: Exception loading segment[clicks2_2015-04-16T15:55:00.000Z_2015-04-16T15:56:00.000Z_2015-04-16T15:55:00.000Z]
at io.druid.server.coordination.ZkCoordinator.loadSegment(ZkCoordinator.java:138) ~[druid-server-0.7.1.1.jar:0.7.1.1]
at io.druid.server.coordination.ZkCoordinator.addSegment(ZkCoordinator.java:163) [druid-server-0.7.1.1.jar:0.7.1.1]
at io.druid.server.coordination.SegmentChangeRequestLoad.go(SegmentChangeRequestLoad.java:42) [druid-server-0.7.1.1.jar:0.7.1.1]
at io.druid.server.coordination.BaseZkCoordinator$1.childEvent(BaseZkCoordinator.java:125) [druid-server-0.7.1.1.jar:0.7.1.1]
at org.apache.curator.framework.recipes.cache.PathChildrenCache$5.apply(PathChildrenCache.java:516) [curator-recipes-2.7.0.jar:?]
at org.apache.curator.framework.recipes.cache.PathChildrenCache$5.apply(PathChildrenCache.java:510) [curator-recipes-2.7.0.jar:?]
at org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:92) [curator-framework-2.7.0.jar:?]
at com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297) [guava-16.0.1.jar:?]
at org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:83) [curator-framework-2.7.0.jar:?]
at org.apache.curator.framework.recipes.cache.PathChildrenCache.callListeners(PathChildrenCache.java:507) [curator-recipes-2.7.0.jar:?]
at org.apache.curator.framework.recipes.cache.EventOperation.invoke(EventOperation.java:35) [curator-recipes-2.7.0.jar:?]
at org.apache.curator.framework.recipes.cache.PathChildrenCache$9.run(PathChildrenCache.java:759) [curator-recipes-2.7.0.jar:?]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [?:1.7.0_25]
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) [?:1.7.0_25]
at java.util.concurrent.FutureTask.run(FutureTask.java:166) [?:1.7.0_25]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [?:1.7.0_25]
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) [?:1.7.0_25]
at java.util.concurrent.FutureTask.run(FutureTask.java:166) [?:1.7.0_25]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [?:1.7.0_25]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [?:1.7.0_25]
at java.lang.Thread.run(Thread.java:724) [?:1.7.0_25]
Caused by: io.druid.segment.loading.SegmentLoadingException: S3 fail! Key[s3://druid-test1/clicks2/2015-04-16T15:55:00.000Z_2015-04-16T15:56:00.000Z/2015-04-16T15:55:00.000Z/0/index.zip]
at io.druid.storage.s3.S3DataSegmentPuller.isObjectInBucket(S3DataSegmentPuller.java:161) ~[?:?]
at io.druid.storage.s3.S3DataSegmentPuller.getSegmentFiles(S3DataSegmentPuller.java:69) ~[?:?]
at io.druid.segment.loading.OmniSegmentLoader.getSegmentFiles(OmniSegmentLoader.java:137) ~[druid-server-0.7.1.1.jar:0.7.1.1]
at io.druid.segment.loading.OmniSegmentLoader.getSegment(OmniSegmentLoader.java:93) ~[druid-server-0.7.1.1.jar:0.7.1.1]
at io.druid.server.coordination.ServerManager.loadSegment(ServerManager.java:150) ~[druid-server-0.7.1.1.jar:0.7.1.1]
at io.druid.server.coordination.ZkCoordinator.loadSegment(ZkCoordinator.java:134) ~[druid-server-0.7.1.1.jar:0.7.1.1]
… 20 more
Caused by: org.jets3t.service.S3ServiceException: Request Error.
at org.jets3t.service.S3Service.getObjectDetails(S3Service.java:1775) ~[jets3t-0.9.3.jar:0.9.3]
at io.druid.storage.s3.S3Utils.isObjectInBucket(S3Utils.java:84) ~[?:?]
at io.druid.storage.s3.S3DataSegmentPuller$2.call(S3DataSegmentPuller.java:155) ~[?:?]
at io.druid.storage.s3.S3DataSegmentPuller$2.call(S3DataSegmentPuller.java:151) ~[?:?]
at com.metamx.common.RetryUtils.retry(RetryUtils.java:38) ~[java-util-0.26.15.jar:?]
at io.druid.storage.s3.S3Utils.retryS3Operation(S3Utils.java:77) ~[?:?]
at io.druid.storage.s3.S3DataSegmentPuller.isObjectInBucket(S3DataSegmentPuller.java:149) ~[?:?]
at io.druid.storage.s3.S3DataSegmentPuller.getSegmentFiles(S3DataSegmentPuller.java:69) ~[?:?]
at io.druid.segment.loading.OmniSegmentLoader.getSegmentFiles(OmniSegmentLoader.java:137) ~[druid-server-0.7.1.1.jar:0.7.1.1]
at io.druid.segment.loading.OmniSegmentLoader.getSegment(OmniSegmentLoader.java:93) ~[druid-server-0.7.1.1.jar:0.7.1.1]
at io.druid.server.coordination.ServerManager.loadSegment(ServerManager.java:150) ~[druid-server-0.7.1.1.jar:0.7.1.1]
at io.druid.server.coordination.ZkCoordinator.loadSegment(ZkCoordinator.java:134) ~[druid-server-0.7.1.1.jar:0.7.1.1]
… 20 more
Caused by: org.jets3t.service.impl.rest.HttpException
at org.jets3t.service.impl.rest.httpclient.RestStorageService.performRequest(RestStorageService.java:423) ~[jets3t-0.9.3.jar:0.9.3]
at org.jets3t.service.impl.rest.httpclient.RestStorageService.performRequest(RestStorageService.java:277) ~[jets3t-0.9.3.jar:0.9.3]
at org.jets3t.service.impl.rest.httpclient.RestStorageService.performRestHead(RestStorageService.java:1038) ~[jets3t-0.9.3.jar:0.9.3]
at org.jets3t.service.impl.rest.httpclient.RestStorageService.getObjectImpl(RestStorageService.java:2250) ~[jets3t-0.9.3.jar:0.9.3]
at org.jets3t.service.impl.rest.httpclient.RestStorageService.getObjectDetailsImpl(RestStorageService.java:2179) ~[jets3t-0.9.3.jar:0.9.3]
at org.jets3t.service.S3Service.getObjectDetails(S3Service.java:2574) ~[jets3t-0.9.3.jar:0.9.3]
at org.jets3t.service.S3Service.getObjectDetails(S3Service.java:1773) ~[jets3t-0.9.3.jar:0.9.3]
at io.druid.storage.s3.S3Utils.isObjectInBucket(S3Utils.java:84) ~[?:?]
at io.druid.storage.s3.S3DataSegmentPuller$2.call(S3DataSegmentPuller.java:155) ~[?:?]
at io.druid.storage.s3.S3DataSegmentPuller$2.call(S3DataSegmentPuller.java:151) ~[?:?]
at com.metamx.common.RetryUtils.retry(RetryUtils.java:38) ~[java-util-0.26.15.jar:?]
at io.druid.storage.s3.S3Utils.retryS3Operation(S3Utils.java:77) ~[?:?]
at io.druid.storage.s3.S3DataSegmentPuller.isObjectInBucket(S3DataSegmentPuller.java:149) ~[?:?]
at io.druid.storage.s3.S3DataSegmentPuller.getSegmentFiles(S3DataSegmentPuller.java:69) ~[?:?]
at io.druid.segment.loading.OmniSegmentLoader.getSegmentFiles(OmniSegmentLoader.java:137) ~[druid-server-0.7.1.1.jar:0.7.1.1]
at io.druid.segment.loading.OmniSegmentLoader.getSegment(OmniSegmentLoader.java:93) ~[druid-server-0.7.1.1.jar:0.7.1.1]
at io.druid.server.coordination.ServerManager.loadSegment(ServerManager.java:150) ~[druid-server-0.7.1.1.jar:0.7.1.1]
at io.druid.server.coordination.ZkCoordinator.loadSegment(ZkCoordinator.java:134) ~[druid-server-0.7.1.1.jar:0.7.1.1]

	... 20 more

The actual exception is buried in the stack trace:

Caused by: io.druid.segment.loading.SegmentLoadingException: S3 fail! Key[s3://druid-test1/clicks2/2015-04-16T15:55:00.000Z_2015-04-16T15:56:00.000Z/2015-04-16T15:55:00.000Z/0/index.zip]

A few questions:

  1. Do you see this error constantly or is it transient? S3 is sometimes flaky

  2. If the error is constant, does it always fail for the same segment or many different segments?

  3. If the error fails for the same segment, does the segment in question exist in s3 at the location provided?

Hi Fangjing,

Thanks for the reply.

  1. Yes. The error is constant and happens on a fresh cluster with no data as well as on a cluster with segments indexed and stored (on S3)

  2. It fails on all segments (ie. multiple segments indexed & stored, fails on all of them)

  3. The segments exist in S3

Hi, is it possible that the keys you are using only have upload but not download access?

Hi,

I’ve been going through the setup, testing stuff and I’ve managed to get the historical node working again. How I got it working was to change the S3 bucket location from Frankfurt to US Standard.

I don’t understand why but from what I’ve seen from various sources and error messages I’ve come across, there seems to be a different authentication mechanism between the locations. I saw and error that said one should use “AWS4-HMAC-SHA256” for authentication when trying to access the buckets/objects in Frankfurt. I still don’t quite understand what the difference between the two regions are and why the difference is causing such an error. Any insights to this?

That all being said, our production data will need to be stored in Frankfurt (other locations are not an option for us).

This has been fixed already and will be part of the next release. It’s due to Amazon having different requirements for newer locations.

If you simply rebuild Druid with the latest version of jets3t it should work in all locations.

Xavier

Thanks Xavier and Fangjin. Will try with snapshot.

Is there an issue on Github to refer to?

I believe Xavier was referring to this issue: https://github.com/druid-io/druid/pull/1205