Indexing task not shutting down on time

In my druid setup there are 2 hosts running the middle managers each with capacity of 4 and the overlord runs on 1 of these boxes.
On one of these middle managers i am observing that the indexing task never get shut even 10 hours past there shutdown time.

From the the task logs i can observe that the task started the “merge and persist” routine at 14:31 (where 14:30 was the task shutdown time):

2015-08-10T14:31:04,947 INFO [mergedTest1-2015-08-10T13:30:00.000Z-persist-n-merge] io.druid.indexing.common.actions.RemoteTaskActionClient - Performing action for task[index_realtime_mergedTest1_2015-08-10T13:30:00.000Z_0_0]: SegmentInsertAction.....

And than rest of the log lines look like these :

2015-08-10T14:32:00,004 INFO [mergedTest1-overseer-0] io.druid.segment.realtime.plumber.RealtimePlumber - Starting merge and push.
2015-08-10T14:32:00,005 INFO [mergedTest1-overseer-0] io.druid.segment.realtime.plumber.RealtimePlumber - Found [1] sinks. minTimestamp [1970-01-01T00:00:00.000Z]
2015-08-10T14:32:00,005 WARN [mergedTest1-overseer-0] io.druid.segment.realtime.plumber.RealtimePlumber - [2015-08-10T13:30:00.000Z] < [1970-01-01T00:00:00.000Z] Skipping persist and merge.
2015-08-10T14:32:00,005 INFO [mergedTest1-overseer-0] io.druid.segment.realtime.plumber.RealtimePlumber - Found [0] sinks to persist and merge
2015-08-10T14:33:00,005 INFO [mergedTest1-overseer-0] io.druid.segment.realtime.plumber.RealtimePlumber - Starting merge and push.
2015-08-10T14:33:00,005 INFO [mergedTest1-overseer-0] io.druid.segment.realtime.plumber.RealtimePlumber - Found [1] sinks. minTimestamp [1970-01-01T00:00:00.000Z]
2015-08-10T14:33:00,005 WARN [mergedTest1-overseer-0] io.druid.segment.realtime.plumber.RealtimePlumber - [2015-08-10T13:30:00.000Z] < [1970-01-01T00:00:00.000Z] Skipping persist and merge.

With occasional log line in between mentioning disappearance of the server :

2015-08-10T14:46:30,204 INFO [ServerInventoryView-0] io.druid.client.BatchServerInventoryView - Server Disappeared[DruidServerMetadata{name=‘dp2013.XXX.XXX.XXX.XXX:8100’, host=‘dp2013.XXX.XXX.XXX.XXX:8100’, maxSize=0, tier=’_default_tier’, type=‘realtime’, priority=‘0’}]
2015-08-10T14:46:37,112 INFO [ServerInventoryView-0] io.druid.curator.inventory.CuratorInventoryManager - Created new InventoryCacheListener for /druid/segments/dp2013.XXX.XXX.XXX.XXX:8100
2015-08-10T14:46:37,112 INFO [ServerInventoryView-0] io.druid.client.BatchServerInventoryView - New Server[DruidServerMetadata{name=‘dp2013.XXX.XXX.XXX.XXX:8100’, host=‘dp2013.XXX.XXX.XXX.XXX:8100’, maxSize=0, tier=’_default_tier’, type=‘realtime’, priority=‘0’}]

What could be wrong here??

PS: I am using druid 0.7.3

Thanks

Rohit

Hi Rohit,

After creating a segment and adding that to metadata storage the realtime task waits for the segment to be handed off to a historical node before it shuts down.

Do you have the coordinator and historical nodes running and have free capacity on the historical nodes to load segments ?

are there any exceptions in the coordinator/historical logs related to segment loading ?

Hello Nishant,
I couldn’t find any relevant error/exception in the co-ordinator logs but i could find these exceptions in the logs of the historical node:

2015-08-10T14:34:25,152 ERROR [ZkCoordinator-0] io.druid.server.coordination.ZkCoordinator - Failed to load segment for dataSource: {class=io.druid.server.coordination.ZkCoordinator, exceptionType=class io.druid.segment.loading.SegmentLoadingException, exceptionMessage=Exception loading segment[mergedTest1_2015-08-10T13:30:00.000Z_2015-08-10T13:31:00.000Z_2015-08-10T13:30:14.148Z], segment=DataSegment{size=1414033, shardSpec=LinearShardSpec{partitionNum=0}, metrics=XXX, version=‘2015-08-10T13:30:14.148Z’, loadSpec={type=local, path=/tmp/druid/localStorage/mergedTest1/2015-08-10T13:30:00.000Z_2015-08-10T13:31:00.000Z/2015-08-10T13:30:14.148Z/0/index.zip}, interval=2015-08-10T13:30:00.000Z/2015-08-10T13:31:00.000Z, dataSource=‘mergedTest1’, binaryVersion=‘9’}}

io.druid.segment.loading.SegmentLoadingException: Exception loading segment[mergedTest1_2015-08-10T13:30:00.000Z_2015-08-10T13:31:00.000Z_2015-08-10T13:30:14.148Z]

at io.druid.server.coordination.ZkCoordinator.loadSegment(ZkCoordinator.java:138) ~[druid-server-0.7.3.jar:0.7.3]

at io.druid.server.coordination.ZkCoordinator.addSegment(ZkCoordinator.java:163) [druid-server-0.7.3.jar:0.7.3]

at io.druid.server.coordination.SegmentChangeRequestLoad.go(SegmentChangeRequestLoad.java:42) [druid-server-0.7.3.jar:0.7.3]

at io.druid.server.coordination.BaseZkCoordinator$1.childEvent(BaseZkCoordinator.java:125) [druid-server-0.7.3.jar:0.7.3]

at org.apache.curator.framework.recipes.cache.PathChildrenCache$5.apply(PathChildrenCache.java:516) [curator-recipes-2.7.0.jar:?]

at org.apache.curator.framework.recipes.cache.PathChildrenCache$5.apply(PathChildrenCache.java:510) [curator-recipes-2.7.0.jar:?]

at org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:92) [curator-framework-2.7.0.jar:?]

at com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297) [guava-16.0.1.jar:?]

at org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:83) [curator-framework-2.7.0.jar:?]

at org.apache.curator.framework.recipes.cache.PathChildrenCache.callListeners(PathChildrenCache.java:507) [curator-recipes-2.7.0.jar:?]

at org.apache.curator.framework.recipes.cache.EventOperation.invoke(EventOperation.java:35) [curator-recipes-2.7.0.jar:?]

at org.apache.curator.framework.recipes.cache.PathChildrenCache$9.run(PathChildrenCache.java:759) [curator-recipes-2.7.0.jar:?]

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_40]

at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_40]

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_40]

at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_40]

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_40]

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_40]

at java.lang.Thread.run(Thread.java:745) [?:1.8.0_40]

Caused by: java.lang.IllegalArgumentException: Instantiation of [simple type, class io.druid.segment.loading.LocalLoadSpec] value failed: [/tmp/druid/localStorage/mergedTest1/2015-08-10T13:30:00.000Z_2015-08-10T13:31:00.000Z/2015-08-10T13:30:14.148Z/0/index.zip] does not exist

at com.fasterxml.jackson.databind.ObjectMapper._convert(ObjectMapper.java:2774) ~[jackson-databind-2.4.4.jar:2.4.4]

at com.fasterxml.jackson.databind.ObjectMapper.convertValue(ObjectMapper.java:2700) ~[jackson-databind-2.4.4.jar:2.4.4]

at io.druid.segment.loading.SegmentLoaderLocalCacheManager.getSegmentFiles(SegmentLoaderLocalCacheManager.java:140) ~[druid-server-0.7.3.jar:0.7.3]

at io.druid.segment.loading.SegmentLoaderLocalCacheManager.getSegment(SegmentLoaderLocalCacheManager.java:93) ~[druid-server-0.7.3.jar:0.7.3]

at io.druid.server.coordination.ServerManager.loadSegment(ServerManager.java:150) ~[druid-server-0.7.3.jar:0.7.3]

at io.druid.server.coordination.ZkCoordinator.loadSegment(ZkCoordinator.java:134) ~[druid-server-0.7.3.jar:0.7.3]

… 18 more

Caused by: com.fasterxml.jackson.databind.JsonMappingException: Instantiation of [simple type, class io.druid.segment.loading.LocalLoadSpec] value failed: [/tmp/druid/localStorage/mergedTest1/2015-08-10T13:30:00.000Z_2015-08-10T13:31:00.000Z/2015-08-10T13:30:14.148Z/0/index.zip] does not exist

at com.fasterxml.jackson.databind.deser.std.StdValueInstantiator.wrapException(StdValueInstantiator.java:405) ~[jackson-databind-2.4.4.jar:2.4.4]

at com.fasterxml.jackson.databind.deser.std.StdValueInstantiator.createFromObjectWith(StdValueInstantiator.java:234) ~[jackson-databind-2.4.4.jar:2.4.4]

at com.fasterxml.jackson.databind.deser.impl.PropertyBasedCreator.build(PropertyBasedCreator.java:167) ~[jackson-databind-2.4.4.jar:2.4.4]

at com.fasterxml.jackson.databind.deser.BeanDeserializer._deserializeUsingPropertyBased(BeanDeserializer.java:398) ~[jackson-databind-2.4.4.jar:2.4.4]

at com.fasterxml.jackson.databind.deser.BeanDeserializerBase.deserializeFromObjectUsingNonDefault(BeanDeserializerBase.java:1064) ~[jackson-databind-2.4.4.jar:2.4.4]

at com.fasterxml.jackson.databind.deser.BeanDeserializer.deserializeFromObject(BeanDeserializer.java:264) ~[jackson-databind-2.4.4.jar:2.4.4]

Caused by: java.lang.IllegalArgumentException: [/tmp/druid/localStorage/mergedTest1/2015-08-10T13:30:00.000Z_2015-08-10T13:31:00.000Z/2015-08-10T13:30:14.148Z/0/index.zip] does not exist
at com.google.api.client.repackaged.com.google.common.base.Preconditions.checkArgument(Preconditions.java:119) ~[google-http-client-1.15.0-rc.jar:?]
at com.google.api.client.util.Preconditions.checkArgument(Preconditions.java:69) ~[google-http-client-1.15.0-rc.jar:?]
at io.druid.segment.loading.LocalLoadSpec.(LocalLoadSpec.java:49) ~[druid-server-0.7.3.jar:0.7.3]
at sun.reflect.GeneratedConstructorAccessor41.newInstance(Unknown Source) ~[?:?]
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) ~[?:1.8.0_40]
at java.lang.reflect.Constructor.newInstance(Constructor.java:422) ~[?:1.8.0_40]
at com.fasterxml.jackson.databind.introspect.AnnotatedConstructor.call(AnnotatedConstructor.java:125) ~[jackson-databind-2.4.4.jar:0.7.3]
at com.fasterxml.jackson.databind.deser.std.StdValueInstantiator.createFromObjectWith(StdValueInstantiator.java:230) ~[jackson-databind-2.4.4.jar:2.4.4]
at com.fasterxml.jackson.databind.deser.impl.PropertyBasedCreator.build(PropertyBasedCreator.java:167) ~[jackson-databind-2.4.4.jar:2.4.4]

I am not sure if the indexing task erratic behaviour is causing these exception on the historical node or vice versa.

Thanks

Rohit

ah, It seems you are using local disk as deep storage and the middlemanager and historical nodes are on diff machines cauing this.
Fwiw, when running on more than one hosts, the deep storage needs to be shared across druid nodes.

you can find more info on supported deep storage options and how to configure them here -

http://druid.io/docs/latest/dependencies/deep-storage.html

Nishant,
Yes i am using local disk as deep storage.

So my current setup is like this :

Node 1 : Coordinator,Historical Node,Middle Manager,Broker

Node 2 : Middle Manager

The issue i reported was seen on the tasks running on Node1 which is co-located with historical node and the deep storage,whereas task running on Node 2 seems to run fine hence i m trying to understand how things would have worked for Node 2 when it doesn’t have access to deep storage.

I was under the assumption that Peons will communicate with historical node via the RPC and hence need not have shared storage with them.

Thanks for your help :slight_smile:

Hi Rohit,
See Inline