Tasks keep failing but there are no error logs in the worker task's log file

Hi guys
I have set up druid locally for testing an ingestion spec file. The aim is to ingest avro format messages into druid. I have installed schema registry and kafka locally with 1 partition and replication factor 1. I have attached the ingestion spec file and the log files of overlord, middle manager and worker task.

Druid version : 0.14.2-incubating

There is a crash log in overlord.log. Its because the task failed and connection gets refused. But I couldn’t find any logs for why the task failed.

I’m producing messages to the topic and I could verify that using a kafka-avro-console-consumer. I suspect it is something wrong with the setup or configuration. Any help is greatly appreciated.

Thanks
Abraham

overlord.log (170 KB)

middlemanager.log (131 KB)

avro_data_supervisor_templatate.json (1.23 KB)

index_kafka_vision-conformed_5db6b6f8757a0ab_cfcibngo.log (24.5 KB)

Hi all
Adding a few other details
I’m starting druid using the command bin/supervise -c quickstart/tutorial/conf/tutorial-cluster.conf
I have also added extensions druid-kafka-indexing-service and druid-avro-extensions in quickstart/tutorial/conf/druid/_common/common.runtime.properties

Are you using the console to run the task or the command line? Are there any messages that arrive when you submit the task? Does the tutorial data work?

https://druid.apache.org/docs//0.14.2-incubating/tutorials/tutorial-kafka.html

Also, any reason to use the .14 Quickstart rather than .19 or .20?

Are you using the console to run the task or the command line?

I’m using the command curl -XPOST -H'Content-Type: application/json' -d @avro_data_supervisor_template.json http://localhost:7090/druid/indexer/v1/supervisor on the terminal to post supervisor to druid

Are there any messages that arrive when you submit the task?

Mostly no. It arrives maybe approx 5-10 secs after I submit the supervisor.

Does the tutorial data work?

It worked initially when I installed druid locally. I’m yet to try now. Will try that and update here.

any reason to use the .14 Quickstart rather than .19 or .20?

The existing prod deployment that reads json data from kafka uses this version(0.14.2)

Hi Rachel
The tutorial for loading a file works. But the tutorial for Kafka ingestion(https://druid.apache.org/docs/0.14.2-incubating/tutorials/tutorial-kafka.html) fails. The behaviour is the same. Task logs fails without any error messages.In the overlord logs, there is a crash message after trying to connect to a failed task.

Attaching the overlord.log file and a few corresponding task files.

There are repeated crash logs at the beginning of overlord.log file Error connecting to server localhost on port 1,527 with message Connection refused (Connection refused). Not sure if that’s expected or that is the real cause of this problem.

overlord.txt (77.6 KB)

index_kafka_wikipedia_59dcea86b30dcdf_bebmfiod.log (17.5 KB)

index_kafka_wikipedia_59dcea86b30dcdf_koghokai.log (17.5 KB)

index_kafka_wikipedia_59dcea86b30dcdf_cfcibngo.log (17.5 KB)

Hi,

The tasks are being killed for example- index_kafka_vision-conformed_5db6b6f8757a0ab_bebmfiod

2020-10-09T07:12:37,622 INFO [Curator-PathChildrenCache-1] org.apache.druid.indexing.overlord.TaskRunnerUtils - Task [index_kafka_vision-conformed_5db6b6f8757a0ab_bebmfiod] location changed to [TaskLocation{host=‘localhost’, port=8100, tlsPort=-1}].

2020-10-09T07:13:46,415 WARN [IndexTaskClient-vision-conformed-0] org.apache.druid.indexing.common.IndexTaskClient - Retries exhausted for [http://localhost:8100/druid/worker/v1/chat/index_kafka_vision-conformed_5db6b6f8757a0ab_bebmfiod/status], last exception:
java.net.ConnectException: Connection refused (Connection refused)
at java.net.PlainSocketImpl.socketConnect(Native Method) ~[?:1.8.0_202]
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) ~[?:1.8.0_202]
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) ~[?:1.8.0_202]
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) ~[?:1.8.0_202]
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) ~[?:1.8.0_202]
at java.net.Socket.connect(Socket.java:589) ~[?:1.8.0_202]

2020-10-09T07:13:46,416 INFO [KafkaSupervisor-vision-conformed] org.apache.druid.indexing.overlord.RemoteTaskRunner - Shutdown [index_kafka_vision-conformed_5db6b6f8757a0ab_bebmfiod] because: [Task [index_kafka_vision-conformed_5db6b6f8757a0ab_bebmfiod] failed to return status, killing task]
2020-10-09T07:13:46,585 INFO [KafkaSupervisor-vision-conformed] org.apache.druid.indexing.overlord.RemoteTaskRunner - Sent shutdown message to worker: localhost:8091, status 200 OK, response: {“task”:“index_kafka_vision-conformed_5db6b6f8757a0ab_bebmfiod”}
2020-10-09T07:13:46,585 INFO [KafkaSupervisor-vision-conformed] org.apache.druid.indexing.overlord.TaskLockbox - Removing task[index_kafka_vision-conformed_5db6b6f8757a0ab_bebmfiod] from activeTasks

I see you have set below in your supervisor spec - “taskDuration”: “PT1M”, “completionTimeout”: “PT2M” which is way too less. As you have already run the standard tutorial - I am assuming you did not make any changes (except Kafka broker details ) in the spec and that still failed with the same error. Could you verify below and try see how that goes -

  1. Check the Port 8100 in your local if it’s all good.
  2. Increase the taskCount to 2 and run the standard Kafka indexing tutorial , if that still fails with no error /details
  3. Turn on the DEBUG logging - This will result in debug level logging.

Thanks and Regards,
Vaibhav

Hi Vaibhav

Please find the results given below

  1. I checked the port number 8100. It is not used by any other service in my machine. So it’s available.
  2. I increased the taskCount to 2. It still didn’t make a difference. The same behaviour was observed.
  3. I have set the log level to debug and tried. Attaching those logs. There is a crash in the task logs which states java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. Not sure if that’s the cause of the problem.

The task duration for the tutorial spec file is PT10M.

Regards
Abraham

index_kafka_wikipedia_066842f90ab7852_bebmfiod.log (77.8 KB)

index_kafka_wikipedia_066842f90ab7852_cfcibngo.log (77.8 KB)

overlord.txt (1.13 MB)

index_kafka_wikipedia_066842f90ab7852_koghokai.log (77.8 KB)

I set that env variable *HADOOP_HOME* and tried again. This time that crash didn’t happen. But the overall behaviour remains the same.

Are there any errors in the middle manager logs?

In the past, when I have seen a “connection refused” error, it typically means something else went wrong.

Have you verified that the druid cluster can connect to the kafka cluster, etc? If the tutorial isn’t working, it seems like some type of configuration problem… I would check network connectivity, etc as a first step.

I’m attaching all the log files under var/sv and a task log file. I could see in overlord.log, connection is established with kafka and metadata information could be retrieved.

In zk.log file, there are repeated logs of connection refused because the client has seen a higher zxid than the server which it is trying to connect. While attempting to reproduce this issue to generate log files, I always clear the var/ directory in druid. Is this the one causing the problem?

The zookeeper error is discussed here https://stackoverflow.com/questions/45804955/zookeeper-refuses-kafka-connection-from-an-old-client and I tried restarting kafka and zookeeper as suggested there. It still didn’t work

The error is also elaborately discussed here https://issues.apache.org/jira/browse/ZOOKEEPER-832 . It is said that 3.4.11 is one of the affected version. Thats the one I use in my machine. Would u suggest changing it to some other version? Or Is there any other directories that I could delete so that the xxid can be reset?

zk.log (321 KB)

router.log (905 KB)

index_kafka_wikipedia_380c1a0193c4383_bebmfiod.log (74.8 KB)

middleManager.log (832 KB)

coordinator.log (2.19 MB)

overlord.log (887 KB)

historical.log (316 KB)

broker.log (775 KB)

I don’t think it would hurt to try to upgrade your ZK version. Also, are you running ZK separate from your master nodes?

I tried with version 3.5.8 and still the same error is getting written in zk.log. The zookeeper is running from {Druid_path}/zk.

zk.log (60.1 KB)

Is it an existing zoopkeeper, or did it come with your Druid?

In the tutorial given here(https://druid.apache.org/docs/0.14.2-incubating/tutorials/index.html), it is said to download using the command curl *https://archive.apache.org/dist/zookeeper/zookeeper-3.4.11/zookeeper-3.4.11.tar.gz -o zookeeper-3.4.11.tar.gz* Hence I got started with version 3.4.11

Version 3.5.8 was downloaded from here( https://www.apache.org/dyn/closer.lua/zookeeper/zookeeper-3.5.8/apache-zookeeper-3.5.8-bin.tar.gz )

what about the kafka instance? Did you use the one that came with the tutorial? Or your own?

And off this went off without a hitch?

In the package root, run the following commands:

curl [https://archive.apache.org/dist/zookeeper/zookeeper-3.4.11/zookeeper-3.4.11.tar.gz](https://archive.apache.org/dist/zookeeper/zookeeper-3.4.11/zookeeper-3.4.11.tar.gz) -o zookeeper-3.4.11.tar.gz
tar -xzf zookeeper-3.4.11.tar.gz
mv zookeeper-3.4.11 zk

The startup scripts for the tutorial will expect the contents of the Zookeeper tarball to be located at zk under the apache-druid-0.14.2-incubating package root.

Yes I did not face any problem while downloading and placing zookeeper at zk/
When I tried zk version 3.5.8, I followed the same steps as given in the tutorial and it got started.

The kafka instance is one that I had already installed. The version that I’m using is 1.0.1

I tried with kafka version .10.2.2 and got the same error in zk.log.

Could you check that the Middle Manager runtime properties (actually, all the configs I guess…!) have the right Zookeeper endpoint? Just checking that the tasks are announcing to the right zookeeper and that the overlord is looking at the right place as well…

Hi Peter
I checked the configuration files of all the components under 'quickstart/tutorial/conf/druid/’. I could not find any configuration related to zoo keeper. Would u suggest checking any particular configuration parameter.
I’m attaching all the property files along with zoo.cfg under 'quickstart/tutorial/conf/zk’ .

zoo.cfg (155 Bytes)