[druid-user] Re: the middleManager realtime task java process not stop

Hey Jianran,

This might be a capacity issue. Make sure your sum of druid.worker.capacity across middleManagers is at least 2 * #partitions * #replicas (see here: https://github.com/druid-io/tranquility/blob/master/docs/overview.md#druid-setup). If it’s lower than this, tasks will block and not be able to start, yielding the kind of errors you’re seeing.

my configuration as follow, the druid.worker.capacity > 2 * #partitions * #replicas,but the task wouldn’t stop

middleManager runtime.properties

druid.worker.capacity=9

BeamFactory

DruidBeams
  .builder((openOrderDO: OpenOrderDO) => openOrderDO.timestamp)
  .curator(curator)
  .discoveryPath(discoveryPath)
  .location(DruidLocation(DruidEnvironment(indexService), dataSource))
  .rollup(DruidRollup(SpecificDruidDimensions(dimensions), aggregators, QueryGranularity.MINUTE))
  .tuning(
    ClusteredBeamTuning(
      segmentGranularity = Granularity.HOUR,
      windowPeriod = new Period("PT10M"),
      partitions = 1,
      replicants = 1
    )
  )
  .buildBeam()

在 2016年7月19日星期二 UTC+8上午1:16:53,Gian Merlino写道:

Ah, okay. If your tasks are never exiting, then what’s probably going on is handoff is not working. Check out this doc for some tips: https://github.com/druid-io/tranquility/blob/master/docs/trouble.md#my-tasks-are-never-exiting

Try double-checking that your coordinator and historicals are running properly. The coordinator log might have some hints about what’s going on.

the following log means the coordinator isn’t running properly?

the logs/coordinator.log

2016-07-19T02:05:10,301 WARN [Coordinator-Exec–0] io.druid.server.coordinator.rules.LoadRule - Not enough [_default_tier] servers or node capacity to assign segment[openOrder_2016-07-16T16:00:00.000Z_2016-07-16T17:00:00.000Z_2016-07-17T00:01:10.743+08:00]! Expected Replicants[2]

2016-07-19T02:05:10,302 INFO [Coordinator-Exec–0] io.druid.server.coordinator.helper.DruidCoordinatorLogger - Server[xxx.xxx.xxx.xxx:8083, historical, _default_tier] has 0 left to load, 0 left to drop, 0 bytes queued, 13,506,318 bytes served.

在 2016年7月19日星期二 UTC+8上午9:00:50,Gian Merlino写道:

Hi there,

I ran into a similar situation over the weekend. I was running the kafka indexing service with an ingestion spec for 2 different topics when my peons started running into OOM errors (one of the ingestion tasks had been running fine for a couple of weeks, the other had had OOM problems which I had thought resolved on Friday). When I came in on Monday the web console showed a single task running, but when I clicked on the log link it said that it couldn’t find the log file. Looking at the host I had a dozen java processes taking up 20g+ each of virtual memory but not using CPU. I restarted the middle manager, coordinator overlord and historical nodes but the processes didn’t go away. I ended up killing them manually. Now I am in a state where the coordinator is giving me the same error as below, however according to the web console I have only 1 task running (but I no data is being indexed and it doesn’t have a log file) and 8 slots free with no pending tasks.

Additionally in the middle manager log I see a few of these exceptions:

com.fasterxml.jackson.databind.JsonMappingException: Could not resolve type id ‘index_kafka’ into a subtype of [simple type, class io.druid.indexing.common.task.Task]

at [Source: [B@6e841513; line: 1, column: 2]

    at com.fasterxml.jackson.databind.JsonMappingException.from(JsonMappingException.java:148) ~[jackson-databind-2.4.6.jar:2.4.6]

    at com.fasterxml.jackson.databind.DeserializationContext.unknownTypeException(DeserializationContext.java:862) ~[jackson-databind-2.4.6.jar:2.4.6]

    at com.fasterxml.jackson.databind.jsontype.impl.TypeDeserializerBase._findDeserializer(TypeDeserializerBase.java:167) ~[jackson-databind-2.4.6.jar:2.4.6]

    at com.fasterxml.jackson.databind.jsontype.impl.AsPropertyTypeDeserializer._deserializeTypedForId(AsPropertyTypeDeserializer.java:99) ~[jackson-databind-2.4.6.jar:2.4.6]

    at com.fasterxml.jackson.databind.jsontype.impl.AsPropertyTypeDeserializer.deserializeTypedFromObject(AsPropertyTypeDeserializer.java:84) ~[jackson-databind-2.4.6.jar:2.4.6]

    at com.fasterxml.jackson.databind.deser.AbstractDeserializer.deserializeWithType(AbstractDeserializer.java:132) ~[jackson-databind-2.4.6.jar:2.4.6]

    at com.fasterxml.jackson.databind.deser.impl.TypeWrappedDeserializer.deserialize(TypeWrappedDeserializer.java:41) ~[jackson-databind-2.4.6.jar:2.4.6]

    at com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:3066) ~[jackson-databind-2.4.6.jar:2.4.6]

    at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:2230) ~[jackson-databind-2.4.6.jar:2.4.6]

    at io.druid.indexing.worker.WorkerTaskMonitor$1.childEvent(WorkerTaskMonitor.java:121) ~[druid-indexing-service-0.9.1-SNAPSHOT.jar:0.9.1-SNAPSHOT]

    at org.apache.curator.framework.recipes.cache.PathChildrenCache$5.apply(PathChildrenCache.java:518) [curator-recipes-2.9.1.jar:?]

    at org.apache.curator.framework.recipes.cache.PathChildrenCache$5.apply(PathChildrenCache.java:512) [curator-recipes-2.9.1.jar:?]

    at org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:92) [curator-framework-2.9.1.jar:?]

    at com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297) [guava-16.0.1.jar:?]

    at org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:83) [curator-framework-2.9.1.jar:?]

    at org.apache.curator.framework.recipes.cache.PathChildrenCache.callListeners(PathChildrenCache.java:509) [curator-recipes-2.9.1.jar:?]

    at org.apache.curator.framework.recipes.cache.EventOperation.invoke(EventOperation.java:35) [curator-recipes-2.9.1.jar:?]

    at org.apache.curator.framework.recipes.cache.PathChildrenCache$9.run(PathChildrenCache.java:766) [curator-recipes-2.9.1.jar:?]

    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [?:1.7.0_101]

    at java.util.concurrent.FutureTask.run(FutureTask.java:262) [?:1.7.0_101]

    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [?:1.7.0_101]

    at java.util.concurrent.FutureTask.run(FutureTask.java:262) [?:1.7.0_101]

    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [?:1.7.0_101]

    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [?:1.7.0_101]

    at java.lang.Thread.run(Thread.java:745) [?:1.7.0_101]

–Ben

Hi Ben, that error points to the fact that your middle managers may not have loaded the correct list of extensions. Are you sure you have the same common config across all your nodes?