Tasks not distributing to middlemanager or possibly failing in middle manager

To be fair, I am unsure if that is the problem.

I have the druid components running in containers on a shared network segment. They all can communicate with each other and with zookeeper using their names (broker resolves to the broker IP, zookeeper to zookeeper, overlord to overlord, etc.).

For all nodes, I have druid.logLevel=debug

I am trying the simplest test possible: the curl from the quickstart guide.

When I submit it, I see the following overlord logs:

overlord_1 | 2017-01-31T23:08:25,022 WARN [rtr-pending-tasks-runner-0] io.druid.indexing.overlord.RemoteTaskRunner - No worker selections strategy set. Using default.

``

Here is the output from zookeeper and coordinator:

coordinator_1 | 2017-01-31T23:18:02,306 WARN [DatabaseSegmentManager-Exec–0] io.druid.metadata.SQLMetadataSegmentManager - No segments found in the database!

zookeeper_1 | 2017-01-31 23:18:06,203 [myid:] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@192] - Accepted socket connection from /192.168.32.16:51482

zookeeper_1 | 2017-01-31 23:18:06,212 [myid:] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@928] - Client attempting to establish new session at /192.168.32.16:51482

zookeeper_1 | 2017-01-31 23:18:06,231 [myid:] - INFO [SyncThread:0:ZooKeeperServer@673] - Established session 0x159f63188ab0017 with negotiated timeout 30000 for client /192.168.32.16:51482

zookeeper_1 | 2017-01-31 23:18:07,917 [myid:] - INFO [ProcessThread(sid:0 cport:2181)::PrepRequestProcessor@487] - Processed session termination for sessionid: 0x159f63188ab0017

zookeeper_1 | 2017-01-31 23:18:07,930 [myid:] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1008] - Closed socket connection for client /192.168.32.16:51482 which had sessionid 0x159f63188ab0017

``

192.168.32.16 resolves to the middlemanager.

Nothing logs to stdout from the middlemanager. However, if I watch it, I see a new java process pop up and disappear quickly.

The task does appear in the overlord console under “Running Tasks”, eventually moving to “Complete Tasks” with a status code of “FAILED”. Payload is correct, status shows FAILED, but clicking the “Log” link gives:

No log was found for this task. The task may not exist, or it may not have begun running yet.

``

I am stumped. How do I figure out which stage is doing what, why it is failing, and where?

Once more piece of info. I tried creating the task again, while logged into the middlemanager and watching /tmp/druid/persistent/task/restore.json

Normally, the contents of that file are:

{“runningTasks”:}

``

However, for a few seconds after submission (about as long as the the second java process lasts), they are:

{“runningTasks”:[“index_hadoop_wikiticker_2017-01-31T23:34:48.189Z”]}

``

So it looks like the task is getting submitted to the middle manager, which is creating a worker for it, but

  1. Why does it fail?
  2. How do I find out why it failed?
  3. Why is neither the broker nor the middlemanager reporting these to the stdout logs?

can you get the task logs ?

the file system does depend on the value of druid.indexer.logs.type

and the location can be set via druid.indexer.logs.directory

Here is the doc about those properties.

Once you find the task logs will find why it is failing.

Ah. I don’t see that defined anywhere. There is a version for S3 commented out in the basic config, but that would explain a lot.

I will set

druid.indexer.logs.type=file

druid.indexer.logs.directory=var/druid/indexing-logs

``

Per the example. Is there a logs.type option for stdout?

At least I have data now. :slight_smile:

2017-02-01T19:37:23,807 WARN [main] org.apache.curator.retry.ExponentialBackoffRetry - maxRetries too large (30). Pinning to 29

2017-02-01T19:37:25,480 ERROR [task-runner-0-priority-0] io.druid.indexing.overlord.ThreadPoolTaskRunner - Exception while running task[HadoopIndexTask{id=index_hadoop_wikiticker_2017-02-01T19:37:16.536Z, type=index_hadoop, dataSource=wikiticker}]

com.metamx.common.ISE: Hadoop dependency [/hadoop-dependencies/hadoop-client/2.3.0] didn’t exist!?

at io.druid.initialization.Initialization.getHadoopDependencyFilesToLoad(Initialization.java:249) ~[druid-server-0.9.2.jar:0.9.2]

at io.druid.indexing.common.task.HadoopTask.buildClassLoader(HadoopTask.java:147) ~[druid-indexing-service-0.9.2.jar:0.9.2]

at io.druid.indexing.common.task.HadoopIndexTask.run(HadoopIndexTask.java:166) ~[druid-indexing-service-0.9.2.jar:0.9.2]

at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:436) [druid-indexing-service-0.9.2.jar:0.9.2]

at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:408) [druid-indexing-service-0.9.2.jar:0.9.2]

at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_111-internal]

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_111-internal]

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_111-internal]

at java.lang.Thread.run(Thread.java:745) [?:1.8.0_111-internal]

``

and in case you asked:

bash-4.3# ls /opt/druid/hadoop-dependencies/hadoop-client/2.3.0/

activation-1.1.jar hadoop-annotations-2.3.0.jar jackson-mapper-asl-1.8.8.jar

avro-1.7.4.jar hadoop-auth-2.3.0.jar jaxb-api-2.2.2.jar

commons-beanutils-1.7.0.jar hadoop-client-2.3.0.jar jersey-core-1.9.jar

commons-beanutils-core-1.8.0.jar hadoop-common-2.3.0.jar jetty-util-6.1.26.jar

commons-cli-1.2.jar hadoop-hdfs-2.3.0.jar jsr305-1.3.9.jar

commons-codec-1.4.jar hadoop-mapreduce-client-app-2.3.0.jar log4j-1.2.17.jar

commons-collections-3.2.1.jar hadoop-mapreduce-client-common-2.3.0.jar paranamer-2.3.jar

commons-compress-1.4.1.jar hadoop-mapreduce-client-core-2.3.0.jar protobuf-java-2.5.0.jar

commons-configuration-1.6.jar hadoop-mapreduce-client-jobclient-2.3.0.jar servlet-api-2.5.jar

commons-digester-1.8.jar hadoop-mapreduce-client-shuffle-2.3.0.jar slf4j-api-1.7.5.jar

commons-httpclient-3.1.jar hadoop-yarn-api-2.3.0.jar slf4j-log4j12-1.7.5.jar

commons-io-2.4.jar hadoop-yarn-client-2.3.0.jar snappy-java-1.0.4.1.jar

commons-lang-2.6.jar hadoop-yarn-common-2.3.0.jar stax-api-1.0-2.jar

commons-logging-1.1.3.jar hadoop-yarn-server-common-2.3.0.jar xmlenc-0.52.jar

commons-math3-3.1.1.jar httpclient-4.2.5.jar xz-1.0.jar

commons-net-3.1.jar httpcore-4.2.5.jar zookeeper-3.4.5.jar

guava-11.0.2.jar jackson-core-asl-1.8.8.jar

``

Is it looking in the wrong place?

can you make sure to set druid.extensions.hadoopDependenciesDir

http://druid.io/docs/latest/configuration/index.html

Definitely getting more interesting:

bash-4.3# cat /var/druid/indexing-logs/index_hadoop_wikiticker_2017-02-01T21:16:19.621Z.log

2017-02-01T21:16:25,205 WARN [main] org.apache.curator.retry.ExponentialBackoffRetry - maxRetries too large (30). Pinning to 29

SLF4J: Class path contains multiple SLF4J bindings.

SLF4J: Found binding in [jar:file:/opt/druid-0.9.2/lib/log4j-slf4j-impl-2.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/opt/druid/hadoop-dependencies/hadoop-client/2.3.0/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.

SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]

2017-02-01 21:16:27,880 task-runner-0-priority-0 WARN JNDI lookup class is not available because this JRE does not support JNDI. JNDI string lookups will not be available, continuing configuration. java.lang.ClassCastException: Cannot cast org.apache.logging.log4j.core.lookup.JndiLookup to org.apache.logging.log4j.core.lookup.StrLookup

at java.lang.Class.cast(Class.java:3369)

at org.apache.logging.log4j.util.LoaderUtil.newCheckedInstanceOf(LoaderUtil.java:167)

2017-02-01 21:16:27,883 task-runner-0-priority-0 WARN JMX runtime input lookup class is not available because this JRE does not support JMX. JMX lookups will not be available, continuing configuration. java.lang.ClassCastException: Cannot cast org.apache.logging.log4j.core.lookup.JmxRuntimeInputArgumentsLookup to org.apache.logging.log4j.core.lookup.StrLookup

at java.lang.Class.cast(Class.java:3369)

at org.apache.logging.log4j.util.LoaderUtil.newCheckedInstanceOf(LoaderUtil.java:167)

2017-02-01 21:16:27,895 task-runner-0-priority-0 WARN JNDI lookup class is not available because this JRE does not support JNDI. JNDI string lookups will not be available, continuing configuration. java.lang.ClassCastException: Cannot cast org.apache.logging.log4j.core.lookup.JndiLookup to org.apache.logging.log4j.core.lookup.StrLookup

at java.lang.Class.cast(Class.java:3369)

at org.apache.logging.log4j.util.LoaderUtil.newCheckedInstanceOf(LoaderUtil.java:167)


2017-02-01 21:16:27,897 task-runner-0-priority-0 WARN JMX runtime input lookup class is not available because this JRE does not support JMX. JMX lookups will not be available, continuing configuration. java.lang.ClassCastException: Cannot cast org.apache.logging.log4j.core.lookup.JmxRuntimeInputArgumentsLookup to org.apache.logging.log4j.core.lookup.StrLookup

at java.lang.Class.cast(Class.java:3369)

at org.apache.logging.log4j.util.LoaderUtil.newCheckedInstanceOf(LoaderUtil.java:167)


2017-02-01 21:16:27,921 task-runner-0-priority-0 WARN JNDI lookup class is not available because this JRE does not support JNDI. JNDI string lookups will not be available, continuing configuration. java.lang.ClassCastException: Cannot cast org.apache.logging.log4j.core.lookup.JndiLookup to org.apache.logging.log4j.core.lookup.StrLookup

at java.lang.Class.cast(Class.java:3369)

at org.apache.logging.log4j.util.LoaderUtil.newCheckedInstanceOf(LoaderUtil.java:167)


2017-02-01 21:16:27,922 task-runner-0-priority-0 WARN JMX runtime input lookup class is not available because this JRE does not support JMX. JMX lookups will not be available, continuing configuration. java.lang.ClassCastException: Cannot cast org.apache.logging.log4j.core.lookup.JmxRuntimeInputArgumentsLookup to org.apache.logging.log4j.core.lookup.StrLookup

at java.lang.Class.cast(Class.java:3369)

at org.apache.logging.log4j.util.LoaderUtil.newCheckedInstanceOf(LoaderUtil.java:167)


2017-02-01T21:16:31,920 WARN [task-runner-0-priority-0] org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform… using builtin-java classes where applicable

2017-02-01T21:16:32,439 WARN [task-runner-0-priority-0] org.apache.hadoop.mapreduce.JobSubmitter - Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.

2017-02-01T21:16:32,444 WARN [task-runner-0-priority-0] org.apache.hadoop.mapreduce.JobSubmitter - No job jar file set. User classes may not be found. See Job or Job#setJar(String).

2017-02-01T21:16:32,467 WARN [task-runner-0-priority-0] org.apache.hadoop.security.UserGroupInformation - PriviledgedActionException as:root (auth:SIMPLE) cause:org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: file:/quickstart/wikiticker-2015-09-12-sampled.json

2017-02-01T21:16:32,471 ERROR [task-runner-0-priority-0] io.druid.indexing.overlord.ThreadPoolTaskRunner - Exception while running task[HadoopIndexTask{id=index_hadoop_wikiticker_2017-02-01T21:16:19.621Z, type=index_hadoop, dataSource=wikiticker}]

java.lang.RuntimeException: java.lang.reflect.InvocationTargetException

at com.google.common.base.Throwables.propagate(Throwables.java:160) ~[guava-16.0.1.jar:?]


2017-02-01T21:16:31,920 WARN [task-runner-0-priority-0] org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform… using builtin-java classes where applicable

2017-02-01T21:16:32,439 WARN [task-runner-0-priority-0] org.apache.hadoop.mapreduce.JobSubmitter - Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.

2017-02-01T21:16:32,444 WARN [task-runner-0-priority-0] org.apache.hadoop.mapreduce.JobSubmitter - No job jar file set. User classes may not be found. See Job or Job#setJar(String).

2017-02-01T21:16:32,467 WARN [task-runner-0-priority-0] org.apache.hadoop.security.UserGroupInformation - PriviledgedActionException as:root (auth:SIMPLE) cause:org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: file:/quickstart/wikiticker-2015-09-12-sampled.json

2017-02-01T21:16:32,471 ERROR [task-runner-0-priority-0] io.druid.indexing.overlord.ThreadPoolTaskRunner - Exception while running task[HadoopIndexTask{id=index_hadoop_wikiticker_2017-02-01T21:16:19.621Z, type=index_hadoop, dataSource=wikiticker}]

java.lang.RuntimeException: java.lang.reflect.InvocationTargetException

at com.google.common.base.Throwables.propagate(Throwables.java:160) ~[guava-16.0.1.jar:?]

at io.druid.indexing.common.task.HadoopTask.invokeForeignLoader(HadoopTask.java:204) ~[druid-indexing-service-0.9.2.jar:0.9.2]

``

getting there !

try tip 2 as indicated on the druid page

http://druid.io/docs/latest/operations/other-hadoop.html

Will do, thank you.

Of course, all of this was just using the basic test from the quickstart page… just to see if the basics work. I still cannot figure out my kafka stream, which was why I did this in the first place.

It looks like it runs through kafka, picked up by tranquility, but then gets lost again. The middlemanager does appear to ask something of zookeeper, unsure what.

Combined logs:

tranquility_1 | 2017-02-01 22:32:51,324 [KafkaConsumer-0] INFO c.m.t.kafka.writer.WriterController - Creating EventWriter for topic [csmessage] using dataSource [cs-conversation]

tranquility_1 | 2017-02-01 22:32:51,919 [KafkaConsumer-0] INFO o.a.c.f.imps.CuratorFrameworkImpl - Starting

tranquility_1 | 2017-02-01 22:32:51,937 [KafkaConsumer-0] INFO org.apache.zookeeper.ZooKeeper - Initiating client connection, connectString=zookeeper:2181 sessionTimeout=60000 watcher=org.apache.curator.ConnectionState@3655f270

tranquility_1 | 2017-02-01 22:32:51,951 [KafkaConsumer-0-SendThread(analyticslauncher_zookeeper_1.analyticslauncher_default:2181)] INFO org.apache.zookeeper.ClientCnxn - Opening socket connection to server analyticslauncher_zookeeper_1.analyticslauncher_default/192.168.32.9:2181. Will not attempt to authenticate using SASL (unknown error)

tranquility_1 | 2017-02-01 22:32:51,955 [KafkaConsumer-0-SendThread(analyticslauncher_zookeeper_1.analyticslauncher_default:2181)] INFO org.apache.zookeeper.ClientCnxn - Socket connection established to analyticslauncher_zookeeper_1.analyticslauncher_default/192.168.32.9:2181, initiating session

zookeeper_1 | 2017-02-01 22:32:51,968 [myid:] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@192] - Accepted socket connection from /192.168.32.15:36872

zookeeper_1 | 2017-02-01 22:32:52,000 [myid:] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@928] - Client attempting to establish new session at /192.168.32.15:36872

tranquility_1 | 2017-02-01 22:32:52,094 [KafkaConsumer-0-SendThread(analyticslauncher_zookeeper_1.analyticslauncher_default:2181)] INFO org.apache.zookeeper.ClientCnxn - Session establishment complete on server analyticslauncher_zookeeper_1.analyticslauncher_default/192.168.32.9:2181, sessionid = 0x159fbca1496000a, negotiated timeout = 40000

zookeeper_1 | 2017-02-01 22:32:52,095 [myid:] - INFO [SyncThread:0:ZooKeeperServer@673] - Established session 0x159fbca1496000a with negotiated timeout 40000 for client /192.168.32.15:36872

tranquility_1 | 2017-02-01 22:32:52,118 [KafkaConsumer-0-EventThread] INFO o.a.c.f.state.ConnectionStateManager - State change: CONNECTED

tranquility_1 | 2017-02-01 22:32:52,668 [KafkaConsumer-0] INFO c.m.t.finagle.FinagleRegistry - Adding resolver for scheme[disco].

coordinator_1 | 2017-02-01T22:32:54,472 WARN [DatabaseSegmentManager-Exec–0] io.druid.metadata.SQLMetadataSegmentManager - No segments found in the database!

tranquility_1 | 2017-02-01 22:32:55,506 [KafkaConsumer-0] INFO o.h.validator.internal.util.Version - HV000001: Hibernate Validator 5.1.3.Final

tranquility_1 | 2017-02-01 22:32:56,001 [KafkaConsumer-0] INFO io.druid.guice.JsonConfigurator - Loaded class[class io.druid.guice.ExtensionsConfig] from props[druid.extensions.] as [ExtensionsConfig{searchCurrentClassloader=true, directory=‘extensions’, hadoopDependenciesDir=‘hadoop-dependencies’, hadoopContainerDruidClasspath=‘null’, loadList=null}]

tranquility_1 | 2017-02-01 22:32:56,469 [KafkaConsumer-0] INFO c.metamx.emitter.core.LoggingEmitter - Start: started [true]

tranquility_1 | 2017-02-01 22:32:57,488 [KafkaConsumer-0] INFO c.m.t.finagle.FinagleRegistry - Adding resolver for scheme[druidTask!druid:overlord].

zookeeper_1 | 2017-02-01 22:32:57,696 [myid:] - INFO [ProcessThread(sid:0 cport:2181)::PrepRequestProcessor@649] - Got user-level KeeperException when processing sessionid:0x159fbca1496000a type:create cxid:0x2 zxid:0x12a txntype:-1 reqpath:n/a Error Path:/tranquility/beams/druid:overlord/cs-conversation Error:KeeperErrorCode = NoNode for /tranquility/beams/druid:overlord/cs-conversation

zookeeper_1 | 2017-02-01 22:32:57,985 [myid:] - INFO [ProcessThread(sid:0 cport:2181)::PrepRequestProcessor@649] - Got user-level KeeperException when processing sessionid:0x159fbca1496000a type:create cxid:0xc zxid:0x130 txntype:-1 reqpath:n/a Error Path:/tranquility/beams/druid:overlord/cs-conversation/mutex/locks Error:KeeperErrorCode = NoNode for /tranquility/beams/druid:overlord/cs-conversation/mutex/locks

tranquility_1 | 2017-02-01 22:32:58,003 [ClusteredBeam-ZkFuturePool-f606b3f0-a681-4765-ad63-27360b5bf720] WARN org.apache.curator.utils.ZKPaths - The version of ZooKeeper being used doesn’t support Container nodes. CreateMode.PERSISTENT will be used instead.

zookeeper_1 | 2017-02-01 22:32:58,083 [myid:] - INFO [ProcessThread(sid:0 cport:2181)::PrepRequestProcessor@649] - Got user-level KeeperException when processing sessionid:0x159fbca1496000a type:create cxid:0x17 zxid:0x134 txntype:-1 reqpath:n/a Error Path:/tranquility/beams/druid:overlord/cs-conversation/mutex/leases Error:KeeperErrorCode = NoNode for /tranquility/beams/druid:overlord/cs-conversation/mutex/leases

tranquility_1 | 2017-02-01 22:32:58,265 [ClusteredBeam-ZkFuturePool-f606b3f0-a681-4765-ad63-27360b5bf720] INFO c.m.tranquility.beam.ClusteredBeam - Creating new merged beam for identifier[druid:overlord/cs-conversation] timestamp[2017-02-01T22:00:00.000Z] (target = 1, actual = 0)

tranquility_1 | 2017-02-01 22:32:58,512 [ClusteredBeam-ZkFuturePool-f606b3f0-a681-4765-ad63-27360b5bf720] INFO com.metamx.common.scala.control$ - Creating druid indexing task (service = druid:overlord): {

tranquility_1 | “type” : “index_realtime”,

tranquility_1 | “id” : “index_realtime_cs-conversation_2017-02-01T22:00:00.000Z_0_0”,

tranquility_1 | “resource” : {

tranquility_1 | “availabilityGroup” : “cs-conversation-2017-02-01T22:00:00.000Z-0000”,

tranquility_1 | “requiredCapacity” : 1

tranquility_1 | },

tranquility_1 | “spec” : {

tranquility_1 | “dataSchema” : {

tranquility_1 | “dataSource” : “cs-conversation”,

tranquility_1 | “parser” : {

tranquility_1 | “type” : “map”,

tranquility_1 | “parseSpec” : {

tranquility_1 | “format” : “json”,

tranquility_1 | “timestampSpec” : {

tranquility_1 | “column” : “createTstamp”,

tranquility_1 | “format” : “millis”,

tranquility_1 | “missingValue” : null

tranquility_1 | },

tranquility_1 | “dimensionsSpec” : {

tranquility_1 | “dimensionExclusions” : [ “count”, “createTstamp” ],

tranquility_1 | “spatialDimensions” :

tranquility_1 | }

tranquility_1 | }

tranquility_1 | },

tranquility_1 | “metricsSpec” : [ {

tranquility_1 | “type” : “count”,

tranquility_1 | “name” : “count”

tranquility_1 | } ],

tranquility_1 | “granularitySpec” : {

tranquility_1 | “type” : “uniform”,

tranquility_1 | “segmentGranularity” : “HOUR”,

tranquility_1 | “queryGranularity” : {

tranquility_1 | “type” : “none”

tranquility_1 | }

tranquility_1 | }

tranquility_1 | },

tranquility_1 | “ioConfig” : {

tranquility_1 | “type” : “realtime”,

tranquility_1 | “plumber” : null,

tranquility_1 | “firehose” : {

tranquility_1 | “type” : “clipped”,

tranquility_1 | “interval” : “2017-02-01T22:00:00.000Z/2017-02-01T23:00:00.000Z”,

tranquility_1 | “delegate” : {

tranquility_1 | “type” : “timed”,

tranquility_1 | “shutoffTime” : “2017-02-01T23:15:00.000Z”,

tranquility_1 | “delegate” : {

tranquility_1 | “type” : “receiver”,

tranquility_1 | “serviceName” : “firehose:druid:overlord:cs-conversation-022-0000-0000”,

tranquility_1 | “bufferSize” : 100000

tranquility_1 | }

tranquility_1 | }

tranquility_1 | }

tranquility_1 | },

tranquility_1 | “tuningConfig” : {

tranquility_1 | “shardSpec” : {

tranquility_1 | “type” : “linear”,

tranquility_1 | “partitionNum” : 0

tranquility_1 | },

tranquility_1 | “rejectionPolicy” : {

tranquility_1 | “type” : “none”

tranquility_1 | },

tranquility_1 | “buildV9Directly” : false,

tranquility_1 | “maxPendingPersists” : 0,

tranquility_1 | “intermediatePersistPeriod” : “PT10M”,

tranquility_1 | “windowPeriod” : “PT10M”,

tranquility_1 | “type” : “realtime”,

tranquility_1 | “maxRowsInMemory” : “100000”

tranquility_1 | }

tranquility_1 | }

tranquility_1 | }

tranquility_1 | Feb 01, 2017 10:32:58 PM com.twitter.finagle.Init$$anonfun$1 apply$mcV$sp

tranquility_1 | INFO: Finagle version 6.31.0 (rev=50d3bb0eea5ad3ed332111d707184c80fed6a506) built at 20151203-164135

tranquility_1 | 2017-02-01 22:32:59,285 [ClusteredBeam-ZkFuturePool-f606b3f0-a681-4765-ad63-27360b5bf720] INFO c.m.c.s.net.finagle.DiscoResolver - Updating instances for service[druid:overlord] to Set(ServiceInstance{name=‘druid:overlord’, id=‘6048e6c9-e931-4b12-87c4-e21f92bb482b’, address=‘overlord’, port=8090, sslPort=null, payload=null, registrationTimeUTC=1485988149060, serviceType=DYNAMIC, uriSpec=null})

tranquility_1 | 2017-02-01 22:32:59,438 [ClusteredBeam-ZkFuturePool-f606b3f0-a681-4765-ad63-27360b5bf720] INFO c.m.t.finagle.FinagleRegistry - Created client for service: disco!druid:overlord

overlord_1 | 2017-02-01T22:33:03,712 WARN [rtr-pending-tasks-runner-0] io.druid.indexing.overlord.RemoteTaskRunner - No worker selections strategy set. Using default.

tranquility_1 | 2017-02-01 22:33:03,959 [finagle/netty3-1] INFO com.metamx.common.scala.control$ - Created druid indexing task with id: index_realtime_cs-conversation_2017-02-01T22:00:00.000Z_0_0 (service = druid:overlord)

tranquility_1 | 2017-02-01 22:33:04,005 [ClusteredBeam-ZkFuturePool-f606b3f0-a681-4765-ad63-27360b5bf720] INFO c.m.t.finagle.DruidTaskResolver - Started monitoring task[index_realtime_cs-conversation_2017-02-01T22:00:00.000Z_0_0] (#1).

tranquility_1 | 2017-02-01 22:33:04,007 [ClusteredBeam-ZkFuturePool-f606b3f0-a681-4765-ad63-27360b5bf720] INFO c.m.t.finagle.DruidTaskResolver - Starting poll thread for indexService[com.metamx.tranquility.druid.IndexService@18de8800], pollPeriod[PT20S].

tranquility_1 | 2017-02-01 22:33:04,016 [ClusteredBeam-ZkFuturePool-f606b3f0-a681-4765-ad63-27360b5bf720] INFO c.m.t.finagle.FinagleRegistry - Created client for service: druidTask!druid:overlord!index_realtime_cs-conversation_2017-02-01T22:00:00.000Z_0_0

tranquility_1 | 2017-02-01 22:33:04,084 [ClusteredBeam-ZkFuturePool-f606b3f0-a681-4765-ad63-27360b5bf720] INFO c.m.tranquility.beam.ClusteredBeam - Created beam: {“interval”:“2017-02-01T22:00:00.000Z/2017-02-01T23:00:00.000Z”,“partition”:0,“tasks”:[{“id”:“index_realtime_cs-conversation_2017-02-01T22:00:00.000Z_0_0”,“firehoseId”:“cs-conversation-022-0000-0000”}],“timestamp”:“2017-02-01T22:00:00.000Z”}

tranquility_1 | 2017-02-01 22:33:04,090 [ClusteredBeam-ZkFuturePool-f606b3f0-a681-4765-ad63-27360b5bf720] INFO c.metamx.tranquility.druid.DruidBeam - Closing Druid beam for datasource[cs-conversation] interval[2017-02-01T22:00:00.000Z/2017-02-01T23:00:00.000Z] (tasks = index_realtime_cs-conversation_2017-02-01T22:00:00.000Z_0_0)

tranquility_1 | 2017-02-01 22:33:04,093 [ClusteredBeam-ZkFuturePool-f606b3f0-a681-4765-ad63-27360b5bf720] INFO c.m.t.finagle.FinagleRegistry - Closing client for service: druidTask!druid:overlord!index_realtime_cs-conversation_2017-02-01T22:00:00.000Z_0_0

tranquility_1 | 2017-02-01 22:33:04,110 [ClusteredBeam-ZkFuturePool-f606b3f0-a681-4765-ad63-27360b5bf720] INFO c.m.t.finagle.DruidTaskResolver - Stopped monitoring task[index_realtime_cs-conversation_2017-02-01T22:00:00.000Z_0_0] (#1).

tranquility_1 | 2017-02-01 22:33:04,128 [ClusteredBeam-ZkFuturePool-f606b3f0-a681-4765-ad63-27360b5bf720] INFO c.m.tranquility.beam.ClusteredBeam - Writing new beam data to[/tranquility/beams/druid:overlord/cs-conversation/data]: {“latestTime”:“2017-02-01T22:00:00.000Z”,“latestCloseTime”:“1970-01-01T00:00:00.000Z”,“beams”:{“2017-02-01T22:00:00.000Z”:[{“interval”:“2017-02-01T22:00:00.000Z/2017-02-01T23:00:00.000Z”,“partition”:0,“tasks”:[{“id”:“index_realtime_cs-conversation_2017-02-01T22:00:00.000Z_0_0”,“firehoseId”:“cs-conversation-022-0000-0000”}],“timestamp”:“2017-02-01T22:00:00.000Z”}]}}

tranquility_1 | 2017-02-01 22:33:04,204 [ClusteredBeam-ZkFuturePool-f606b3f0-a681-4765-ad63-27360b5bf720] INFO c.m.tranquility.beam.ClusteredBeam - Adding beams for identifier[druid:overlord/cs-conversation] timestamp[2017-02-01T22:00:00.000Z]: List(Map(interval -> 2017-02-01T22:00:00.000Z/2017-02-01T23:00:00.000Z, partition -> 0, tasks -> ArraySeq(Map(id -> index_realtime_cs-conversation_2017-02-01T22:00:00.000Z_0_0, firehoseId -> cs-conversation-022-0000-0000)), timestamp -> 2017-02-01T22:00:00.000Z))

tranquility_1 | 2017-02-01 22:33:04,210 [DruidTaskResolver[com.metamx.tranquility.druid.IndexService@18de8800]] INFO c.m.t.finagle.DruidTaskResolver - Updating location for task[index_realtime_cs-conversation_2017-02-01T22:00:00.000Z_0_0] (#1) to[Neg].

tranquility_1 | 2017-02-01 22:33:04,265 [ClusteredBeam-ZkFuturePool-f606b3f0-a681-4765-ad63-27360b5bf720] INFO c.m.t.finagle.DruidTaskResolver - Started monitoring task[index_realtime_cs-conversation_2017-02-01T22:00:00.000Z_0_0] (#2).

tranquility_1 | 2017-02-01 22:33:04,285 [DruidTaskResolver[com.metamx.tranquility.druid.IndexService@18de8800]] INFO c.m.t.finagle.DruidTaskResolver - Updating location for task[index_realtime_cs-conversation_2017-02-01T22:00:00.000Z_0_0] (#2) to[Neg].

tranquility_1 | 2017-02-01 22:33:04,285 [ClusteredBeam-ZkFuturePool-f606b3f0-a681-4765-ad63-27360b5bf720] INFO c.m.t.finagle.FinagleRegistry - Created client for service: druidTask!druid:overlord!index_realtime_cs-conversation_2017-02-01T22:00:00.000Z_0_0

tranquility_1 | Feb 01, 2017 10:33:04 PM com.twitter.finagle.loadbalancer.LoadBalancerFactory$StackModule$$anonfun$5 apply

tranquility_1 | INFO: druidTask!druid:overlord!index_realtime_cs-conversation_2017-02-01T22:00:00.000Z_0_0: name resolution is negative (local dtab: Dtab())

zookeeper_1 | 2017-02-01 22:33:14,253 [myid:] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@192] - Accepted socket connection from /192.168.32.14:53186

zookeeper_1 | 2017-02-01 22:33:14,259 [myid:] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@928] - Client attempting to establish new session at /192.168.32.14:53186

zookeeper_1 | 2017-02-01 22:33:14,320 [myid:] - INFO [SyncThread:0:ZooKeeperServer@673] - Established session 0x159fbca1496000b with negotiated timeout 30000 for client /192.168.32.14:53186

zookeeper_1 | 2017-02-01 22:33:15,025 [myid:] - INFO [ProcessThread(sid:0 cport:2181)::PrepRequestProcessor@649] - Got user-level KeeperException when processing sessionid:0x159fbca1496000b type:create cxid:0x13 zxid:0x13f txntype:-1 reqpath:n/a Error Path:/druid/discovery/firehose:druid:overlord:cs-conversation-022-0000-0000 Error:KeeperErrorCode = NoNode for /druid/discovery/firehose:druid:overlord:cs-conversation-022-0000-0000

zookeeper_1 | 2017-02-01 22:33:15,087 [myid:] - INFO [ProcessThread(sid:0 cport:2181)::PrepRequestProcessor@649] - Got user-level KeeperException when processing sessionid:0x159fbca1496000b type:create cxid:0x19 zxid:0x142 txntype:-1 reqpath:n/a Error Path:/druid/discovery/cs-conversation-022-0000-0000 Error:KeeperErrorCode = NoNode for /druid/discovery/cs-conversation-022-0000-0000

tranquility_1 | 2017-02-01 22:33:25,441 [DruidTaskResolver[com.metamx.tranquility.druid.IndexService@18de8800]] INFO c.m.t.finagle.DruidTaskResolver - Updating location for task[index_realtime_cs-conversation_2017-02-01T22:00:00.000Z_0_0] (#2) to[Bound(Set(middlemanager/192.168.32.14:8100),Map())].

tranquility_1 | 2017-02-01 22:33:28,760 [KafkaConsumer-CommitThread] INFO c.m.tranquility.kafka.KafkaConsumer - Flushed {csmessage={receivedCount=1, sentCount=1, droppedCount=0, unparseableCount=0}} pending messages in 29720ms and committed offsets in 13ms.
broker_1 | 370.475: [GC (Allocation Failure) 370.479: [ParNew: 222360K->17825K(230400K), 2.8099877 secs] 222360K->24416K(1022976K), 2.8152290 secs] [Times: user=0.31 sys=0.55, real=2.81 secs]

broker_1 | 373.292: [GC (CMS Initial Mark) [1 CMS-initial-mark: 6590K(792576K)] 25544K(1022976K), 0.0099279 secs] [Times: user=0.01 sys=0.00, real=0.01 secs]

broker_1 | 373.302: [CMS-concurrent-mark-start]

broker_1 | 373.961: [CMS-concurrent-mark: 0.623/0.659 secs] [Times: user=0.08 sys=0.11, real=0.66 secs]

broker_1 | 373.963: [CMS-concurrent-preclean-start]

broker_1 | 373.972: [CMS-concurrent-preclean: 0.008/0.008 secs] [Times: user=0.01 sys=0.00, real=0.01 secs]

broker_1 | 373.972: [CMS-concurrent-abortable-preclean-start]

broker_1 | CMS: abort preclean due to time 379.085: [CMS-concurrent-abortable-preclean: 1.874/5.112 secs] [Times: user=1.89 sys=0.04, real=5.11 secs]

broker_1 | 379.085: [GC (CMS Final Remark) [YG occupancy: 24487 K (230400 K)]379.085: [Rescan (parallel) , 0.0144542 secs]379.100: [weak refs processing, 0.0001810 secs]379.100: [class unloading, 0.8384265 secs]379.939: [scrub symbol table, 0.2407246 secs]380.180: [scrub string table, 0.0010639 secs][1 CMS-remark: 6590K(792576K)] 31077K(1022976K), 1.0976879 secs] [Times: user=0.11 sys=0.12, real=1.10 secs]

broker_1 | 380.188: [CMS-concurrent-sweep-start]

broker_1 | 380.215: [CMS-concurrent-sweep: 0.027/0.027 secs] [Times: user=0.01 sys=0.00, real=0.03 secs]

broker_1 | 380.215: [CMS-concurrent-reset-start]

broker_1 | 380.226: [CMS-concurrent-reset: 0.010/0.010 secs] [Times: user=0.00 sys=0.01, real=0.01 secs]

``

FYI, 192.168.32.14 that zookeeper is reporting receiving a connection from is middlemanager.

I am unsure who talks to zk and when, so hard for me to debug.

Also, every node - broker, overload, middlemanager, coordinator, historical - has the following in /opt/druid/conf/druid/_common/common.runtime.properties

druid.emitter=logging

druid.emitter.logging.logLevel=debug

``

You know what would have made this process far easier? If there were a page with flow. E.g.:

When a task is received:

  1. It is received by the node XX. The log message confirming receipt is at /some/path with a filename of zzzz
  2. That node passes the task to a middlemanager by finding one in zookeeper, looking at path YY for properties ZZ. The communications with zookeeper generate 3 messages in path ././…
  3. The XX communicates directly with the middlemanager over protocol ___ to pass it the message. The XX records the following message at /././
  4. The middlemanager receives the job and indicates successful receipt by recording message … at /./…
  5. The middlemanager creates an indexer peon by doing AAAA and records message asanas
    You get the idea. For this entire process I have been flying mostly blind, so I have to turn back here. I would love to be able to have a flow of “this is what is to be expected when streaming a message in, and here is how it indicates what it is doing.”

Then, I could follow that through and see where it is falling down… and save you lots of work too! :slight_smile:

Slowly getting there. Turns out that I had a missing directory /tmp/druid/persistent/task/ . Once that existed, the middlemanager creates the task and it seems to be running.

However, now it is stuck in RUNNING mode. Shows that both on coordinator and in logs. All of /var/druid/indexing-logs, /var/druid/segments/, etc. are empty. the logs for the task itself are in: /tmp/druid/persistent/task/TASK_ID, and pretty short:

bash-4.3# cat /tmp/druid/persistent/task/index_realtime_cs-conversation_2017-02-06T10:00:00.000Z_0_0/log

2017-02-06T10:08:03,976 WARN [main] org.apache.curator.retry.ExponentialBackoffRetry - maxRetries too large (30). Pinning to 29

2017-02-06T10:08:06,411 WARN [task-runner-0-priority-0] org.apache.curator.utils.ZKPaths - The version of ZooKeeper being used doesn’t support Container nodes. CreateMode.PERSISTENT will be used instead.

``

That is it, nothing more. Zookeeper logs for the connection from the middlemanager node show:

zookeeper_1 | 2017-02-06 10:08:05,474 [myid:] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@192] - Accepted socket connection from /192.168.32.11:46002

zookeeper_1 | 2017-02-06 10:08:05,482 [myid:] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@928] - Client attempting to establish new session at /192.168.32.11:46002

zookeeper_1 | 2017-02-06 10:08:05,498 [myid:] - INFO [SyncThread:0:ZooKeeperServer@673] - Established session 0x15a12b275c6000e with negotiated timeout 30000 for client /192.168.32.11:46002

zookeeper_1 | 2017-02-06 10:08:06,383 [myid:] - INFO [ProcessThread(sid:0 cport:2181)::PrepRequestProcessor@649] - Got user-level KeeperException when processing sessionid:0x15a12b275c6000e type:create cxid:0x13 zxid:0x15f txntype:-1 reqpath:n/a Error Path:/druid/discovery/firehose:druid:overlord:cs-conversation-010-0000-0000 Error:KeeperErrorCode = NoNode for /druid/discovery/firehose:druid:overlord:cs-conversation-010-0000-0000

zookeeper_1 | 2017-02-06 10:08:06,435 [myid:] - INFO [ProcessThread(sid:0 cport:2181)::PrepRequestProcessor@649] - Got user-level KeeperException when processing sessionid:0x15a12b275c6000e type:create cxid:0x19 zxid:0x162 txntype:-1 reqpath:n/a Error Path:/druid/discovery/cs-conversation-010-0000-0000 Error:KeeperErrorCode = NoNode for /druid/discovery/cs-conversation-010-0000-0000

``

Looking in zk itself:

[zk: localhost:2181(CONNECTED) 13] ls /druid/discovery/cs-conversation-010-0000-0000

[65361c2d-6aa0-4a85-8534-cf8f17e1654d]

[zk: localhost:2181(CONNECTED) 14] ls /druid/discovery/cs-conversation-010-0000-0000/65361c2d-6aa0-4a85-8534-cf8f17e1654d

[zk: localhost:2181(CONNECTED) 15] get /druid/discovery/cs-conversation-010-0000-0000/65361c2d-6aa0-4a85-8534-cf8f17e1654d

{“name”:“cs-conversation-010-0000-0000”,“id”:“65361c2d-6aa0-4a85-8534-cf8f17e1654d”,“address”:“middlemanager”,“port”:8101,“sslPort”:null,“payload”:null,“registrationTimeUTC”:1486375686434,“serviceType”:“DYNAMIC”,“uriSpec”:null}

cZxid = 0x164

ctime = Mon Feb 06 10:08:06 GMT 2017

mZxid = 0x164

mtime = Mon Feb 06 10:08:06 GMT 2017

pZxid = 0x164

cversion = 0

dataVersion = 0

aclVersion = 0

ephemeralOwner = 0x15a12b275c6000e

dataLength = 227

numChildren = 0

``

And… at a dead end again.

And eventually one task does index? It took an extraordinarily long time though.

{"task":"index_realtime_cs-conversation_2017-02-06T09:00:00.000Z_0_0","status":{"id":"index_realtime_cs-conversation_2017-02-06T09:00:00.000Z_0_0","status":"SUCCESS","duration":3549070}}

``