Failure in loading datasource from S3

Hi,

We have a druid setup which reads data from S3 when an indexing job is executed.
It is unclear to me as to where the issue is happening. It kind of hints that it fails in reading/finding the S3 file but I reconfirmed that the file exists.

Any pointers?

-Vishal

INFO 2015-04-23 00:17:29,404 [ZkCoordinator-0] io.druid.server.coordination.ZkCoordinator: Completed request [LOAD: ignite_nmfnl_cube_t35_2015-01-11T00:00:00.000Z_2015-01-12T00:00:00.000Z_2015-04-23T00:06:33.062Z]

ERROR 2015-04-23 00:17:29,404 [ZkCoordinator-0] io.druid.server.coordination.ZkCoordinator: Failed to load segment for dataSource: {class=io.druid.server.coordination.ZkCoordinator, exceptionType=class io.druid.segment.loading.SegmentLoadingException, exceptionMessage=Exception loading segment[ignite_nmfnl_cube_t35_2015-01-11T00:00:00.000Z_2015-01-12T00:00:00.000Z_2015-04-23T00:06:33.062Z], segment=DataSegment{size=420996, shardSpec=NoneShardSpec, metrics=[allocation_cnt, registration_cnt, provided_mop_cnt, completed_signup_cnt, p1_possibly_completed_cnt, p2_possibly_paid_cnt, vol_cancel_cnt, p2_vol_cancel_cnt, invol_cancel_cnt, p2_invol_cancel_cnt, current_member_cnt, retained_cnt, realized_revenue, realized_squared_revenue, active_subscrn_revenue, potential_subscrn_revenue, streamed_gt_0_hrs_allocation_cnt, streamed_ge_1_hrs_allocation_cnt, streamed_ge_5_hrs_allocation_cnt, streamed_ge_10_hrs_allocation_cnt, streamed_ge_20_hrs_allocation_cnt, streamed_ge_40_hrs_allocation_cnt, streamed_ge_80_hrs_allocation_cnt], dimensions=[is_browser, is_other_exclusion, has_completed_signup, mop_custom_category_id, is_fraud, is_former_member, subregion_sk, is_p1_possibly_completed, test_id, current_price_plan_id, custom_filter_1, custom_filter_2, custom_filter_3, signup_price_plan_id, is_signup_price_plan_migrated, is_follow_redirect, test_cell_nbr, promo_custom_category_id, allocation_device_sk], version=‘2015-04-23T00:06:33.062Z’, loadSpec={type=s3_zip, bucket=netflix-dataoven-prod-users, key=druid/datasource/ignite_nmfnl_cube_t35/ignite_nmfnl_cube_t35/20150111T000000.000Z_20150112T000000.000Z/2015-04-23T00_06_33.062Z/0/index.zip}, interval=2015-01-11T00:00:00.000Z/2015-01-12T00:00:00.000Z, dataSource=‘ignite_nmfnl_cube_t35’, binaryVersion=‘9’}}

io.druid.segment.loading.SegmentLoadingException: Exception loading segment[ignite_nmfnl_cube_t35_2015-01-11T00:00:00.000Z_2015-01-12T00:00:00.000Z_2015-04-23T00:06:33.062Z]

at io.druid.server.coordination.ZkCoordinator.loadSegment(ZkCoordinator.java:138)

at io.druid.server.coordination.ZkCoordinator.addSegment(ZkCoordinator.java:163)

at io.druid.server.coordination.SegmentChangeRequestLoad.go(SegmentChangeRequestLoad.java:42)

at io.druid.server.coordination.BaseZkCoordinator$1.childEvent(BaseZkCoordinator.java:125)

at org.apache.curator.framework.recipes.cache.PathChildrenCache$5.apply(PathChildrenCache.java:516)

at org.apache.curator.framework.recipes.cache.PathChildrenCache$5.apply(PathChildrenCache.java:510)

at org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:92)

at com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297)

at org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:83)

at org.apache.curator.framework.recipes.cache.PathChildrenCache.callListeners(PathChildrenCache.java:507)

at org.apache.curator.framework.recipes.cache.EventOperation.invoke(EventOperation.java:35)

at org.apache.curator.framework.recipes.cache.PathChildrenCache$9.run(PathChildrenCache.java:759)

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)

at java.util.concurrent.FutureTask.run(FutureTask.java:262)

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)

at java.util.concurrent.FutureTask.run(FutureTask.java:262)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)

Caused by: io.druid.segment.loading.SegmentLoadingException: S3 fail! Key[s3://netflix-dataoven-prod-users/druid/datasource/ignite_nmfnl_cube_t35/ignite_nmfnl_cube_t35/20150111T000000.000Z_20150112T000000.000Z/2015-04-23T00_06_33.062Z/0/index.zip]

at io.druid.storage.s3.S3DataSegmentPuller.isObjectInBucket(S3DataSegmentPuller.java:162)

at io.druid.storage.s3.S3DataSegmentPuller.getSegmentFiles(S3DataSegmentPuller.java:70)

at io.druid.segment.loading.OmniSegmentLoader.getSegmentFiles(OmniSegmentLoader.java:137)

at io.druid.segment.loading.OmniSegmentLoader.getSegment(OmniSegmentLoader.java:93)

at io.druid.server.coordination.ServerManager.loadSegment(ServerManager.java:143)

at io.druid.server.coordination.ZkCoordinator.loadSegment(ZkCoordinator.java:134)

… 18 more

Caused by: org.jets3t.service.S3ServiceException: Request Error. – ResponseCode: 400, ResponseStatus: Bad Request

at org.jets3t.service.S3Service.getObjectDetails(S3Service.java:1630)

at io.druid.storage.s3.S3Utils.isObjectInBucket(S3Utils.java:84)

at io.druid.storage.s3.S3DataSegmentPuller$2.call(S3DataSegmentPuller.java:156)

at io.druid.storage.s3.S3DataSegmentPuller$2.call(S3DataSegmentPuller.java:152)

at com.metamx.common.RetryUtils.retry(RetryUtils.java:22)

at io.druid.storage.s3.S3Utils.retryS3Operation(S3Utils.java:77)

at io.druid.storage.s3.S3DataSegmentPuller.isObjectInBucket(S3DataSegmentPuller.java:150)

… 23 more

Caused by: org.jets3t.service.impl.rest.HttpException

at org.jets3t.service.impl.rest.httpclient.RestStorageService.performRequest(RestStorageService.java:415)

at org.jets3t.service.impl.rest.httpclient.RestStorageService.performRequest(RestStorageService.java:264)

at org.jets3t.service.impl.rest.httpclient.RestStorageService.performRestHead(RestStorageService.java:912)

at org.jets3t.service.impl.rest.httpclient.RestStorageService.getObjectImpl(RestStorageService.java:2107)

at org.jets3t.service.impl.rest.httpclient.RestStorageService.getObjectDetailsImpl(RestStorageService.java:2036)

at org.jets3t.service.S3Service.getObjectDetails(S3Service.java:2429)

at org.jets3t.service.S3Service.getObjectDetails(S3Service.java:1628)

… 29 more

INFO 2015-04-23 00:17:29,404 [ZkCoordinator-0] io.druid.server.coordination.ZkCoordinator: Node[/druid/loadQueue/ec2-23-20-225-195.compute-1.amazonaws.com:7103/ignite_nmfnl_cube_t35_2015-01-11T00:00:00.000Z_2015-01-12T00:00:00.000Z_2015-04-23T00:06:33.062Z] was removed

Hi Vishal, are you sure the keys you are using have both read and write access, and not just write access?

the keys are not specifically set for these buckets so given some other jobs are running fine I’m assuming this one should also have the correct key.
I’m checking further.

Hi,

The permissions on that S3 files seems to be ok.

Where is druid do i see the key being used to access.

-Vishal

If you are on 0.7.x, the s3 credentials are in the common.runtime.properties.

If you are on 0.6.x, they should be in the runtime.properties associated with the historical.

BTW, is the error occurring for all recently generated segments or just one in particular?

Hi Fangjin,

looks like some of the historical nodes are not updating the key. Some do have the right key.

Seems like awscredentialsprovider.java isn’t running and hence not updating the key.

Do you know how can I restart/resync the key using the above class. What runs it?

-Vishal

Are you guys providing the s3 credentials via a file? If you are setting druid.s3.fileSessionCredentials, this is probably the case. I’m not incredibly familiar with this code as some other folks at Netflix originally contributed it. If the credentials in the file seem correct, perhaps the sync is slow or there is a bug in the logic. Restarting a historical node should provide the most immediate workaround although some further investigation seems warranted for this case.

we restarted the tomcat on the node that refreshed the key. we are looking fine now.

I’ll check more on why it stopped and if it retries

thanks for your help

-Vishal

Great to hear things got resolved!