One or two druid tasks are failing once an hour for the data source

Hello folks:

I created this demo lab using a single physical machine with plenty of RAM and CPU. I am ingesting two tasks and for some reason one of them fails one or twice an hour continuously. The task duration and timeout values are set for 5 minutes in the spec files for these tasks. The data generated is very limited. We are not trying to stress test the platform.

The task that generates relatively more data keeps failing between 50th and 59th minute of hour constantly. The other task is having no issues. I looked at the indexing logs for the failed tasks and cannot figure out what went wrong.

I am also seeing that I have some additional task folders under the ./druid/task folder besides the running ones. I thought these were supposed to be moved to somewhere else once the task is completed; then I noticed that these additional ones are actually the failed ones.

I am attaching two of the failed task logs. Can someone give me a clue as to what is happening. I have been looking at these logs for the last two days, nothing so far.

Thanks

log-1.txt (3.2 MB)

log-2.txt (3.24 MB)

Folks, this keeps happening and I still don’t have a clue. My tasks are failing at random times with very little data. I suspect this is related to the “unknown task” message in overlord logs but I am not sure yet. Can someone help please.

Attached is another failed task and overlord log.

Thanks

overlord-service.zip (1.84 MB)

Where are you ingesting from? Batch or streaming?

I am on my phone right now and your logs are too big for me to view at the moment.

I am ingesting from stream. I can tell you this tho. When my tuning config is as follows I experience the issue quite frequently:

“tuningConfig”: {

“type”: “kafka”,

“reportParseExceptions”: false,

“maxRowsInMemory” : 1000000,

“maxRowsPerSegment” : 5000000,

“maxTotalRows” : 5000000

},

“ioConfig”: {

“topic”: “topic”,

“taskCount”: 1,

“replicas”: 1,

“taskDuration”: “PT5M”,

“completionTimeout”: “PT5M”,

“consumerProperties”: {

“bootstrap.servers”: “192.168.1.50:9092”

}

}

}

However, when I set the timers as follows, the issue seems to be going away:

“tuningConfig”: {

“type”: “kafka”,

“reportParseExceptions”: false,

“maxRowsInMemory” : 1000000,

“maxRowsPerSegment” : 5000000,

“maxTotalRows” : 5000000

},

“ioConfig”: {

“topic”: “topic”,

“taskCount”: 1,

“replicas”: 1,

“taskDuration”: “PT15M”,

“completionTimeout”: “PT5M”,

“consumerProperties”: {

“bootstrap.servers”: “192.168.1.50:9092”

}

}

}

I think I am failing to understand the relationship between all the timers that are involved with ingestion. The message rate is low since it is a proof of concept system and I have plenty of memory (250Gb). There is only one more topic that I ingest on this machine which is even lower in message rate; 4 or 5 messages maybe per minute. The problematic topic is generating about 50 messages per minute on average. Also I am using Druid v0.13.

Thanks

Let’s me ask about your middle manager configuration and resources.

What is your total worker count?

Also, how many topics are you consuming simultaneously with supervisors?

Following is my middleManager config:

druid.host=[my_hostname]

druid.service=druid/middleManager

druid.plaintextPort=8091

# Number of tasks per middleManager

druid.worker.capacity=40

# Task launch parameters

druid.indexer.runner.javaOpts=-server -Xms1g -Xmx1g -XX:+UseG1GC -XX:MaxDirectMemorySize=4g -Daws.region=us-east-1 -Duser.timezone=UTC -Dfile.encoding=UTF-8 -XX:+ExitOnOutOfMemoryError -Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager -Dlog4j.configurationFile=/usr/local/share/druid/conf/druid/_common/log4j2-task.xml

druid.indexer.task.baseTaskDir=/part1/druid/task

# HTTP server threads

druid.server.http.numThreads=60

# Processing threads and buffers on Peons

druid.indexer.fork.property.druid.processing.buffer.sizeBytes=536870912

druid.indexer.fork.property.druid.processing.numThreads=4

druid.indexer.fork.property.druid.processing.numMergeBuffers=2

# Hadoop indexing

druid.indexer.task.hadoopWorkingPath=/part1/druid/hadoop-tmp

I have two data sources, and each one of them has one task assigned (each one ingesting a single Kafka partition). The max worker usage at any given time in this setup is 4. I am only having issue with one of the data sources. As I mentioned earlier, setting the ingestion spec for that task as following helped but I can still see a couple of failing tasks here an there. I attached a couple of failed indexing job that I noticed this morning.

Thanks,

Desktop.zip (687 KB)

You would have to tell me how many cpu you have allocated for middleManager on this system.

Regardless, my recommendation for now would be to:

Reduce the worker capacity # (let’s go to 10)
Set druid.indexer.fork.property.druid.processing.numThreads=2

Give the peons more Heap memory so raise to 2G or 4G ram. eg. -Xms2g -Xmx2g

Restart middleManager and let me know.

Thank you, I will apply these changes and check the result. Following is my JVM config for the middleManager:

-server

-Xms1g

-Xmx1g

-XX:+ExitOnOutOfMemoryError

-XX:+UseG1GC

-Duser.timezone=UTC

-Dfile.encoding=UTF-8

-Djava.io.tmpdir=/part1/druid/tmp

-Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager

-Daws.region=us-east-1

-Dservice=middleManager

I have a question for you tho. The resources I have available on this piece of hardware is larger than what Druid can consume or need based on the settings I have. Even if I have 40 workers defined in the config I only use 4 tops. So why am I reducing the system specs in the config files?

middle manager itself only needs:

-Xmx128m
-Xms128m

the peons need more.

back to your question about system spec…i don’t know how many cpus and RAM you actually have reserved for middleManager…however 40 workers x numThreads=4 means you would need 160 cpu at max capacity just for middleManager and peon processes. I realize you have not hit that yet, but regardless I feel that configuration should modified.

Let’s get things to a more manageable # and scale it up later if needed.

For sure, give the peons at least 2 GB ram for heap.

Hello Chris:

I reduced the worker size to 20 from 40. Following is my middleManager config:

druid.host=mtllab1.demo.local

druid.service=druid/middleManager

druid.plaintextPort=8091

# Number of tasks per middleManager

druid.worker.capacity=20

# Task launch parameters

druid.indexer.runner.javaOpts=-server -Xms2g -Xmx2g -XX:+UseG1GC -XX:MaxDirectMemorySize=4g -Daws.region=us-east-1 -Duser.timezone=UTC -Dfile.encoding=UTF-8 -XX:+ExitOnOutOfMemoryError -Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager -Dlog4j.configurationFile=/usr/local/share/druid/conf/druid/_common/log4j2-task.xml

druid.indexer.task.baseTaskDir=/part1/druid/task

# HTTP server threads

druid.server.http.numThreads=60

# Processing threads and buffers on Peons

druid.indexer.fork.property.druid.processing.buffer.sizeBytes=536870912

druid.indexer.fork.property.druid.processing.numThreads=2

druid.indexer.fork.property.druid.processing.numMergeBuffers=2

# Hadoop indexing

druid.indexer.task.hadoopWorkingPath=/part1/druid/hadoop-tmp

My ioconfig in the indexing spec file is still the same for netflow datasource:

"ioConfig": {

"topic": “topic”,

"taskCount": 1,

"replicas": 1,

"taskDuration": “PT15M”,

"completionTimeout": “PT5M”,

"consumerProperties": {

"bootstrap.servers": "mtllab1.demo.local:9092"

It was going well for a while but then I noticed another failed task for netflow data source. I was checking the traffic volume and it looks like my max message count per minute is 60 for this Kafka Topic. Anyways, I attached the task log, overlord and middleManager logs. I appreciate your help on this. I noticed that overlord logs still complain about this task being “unknown”. I am not sure why.

Thanks

middleManager-service.zip (3.97 MB)

I am also attaching my htop output on the server just in case you are interested.

Hi,

Looking at your logs, the errors that Im seeing are timeout exceptions: java.util.concurrent.TimeoutException: Idle timeout expired: 300000/300000 ms

It’s a 5 minute timeout which matches your config value for completionTimeout. There are lots of idle timeout check before the log reaches TimeoutException.

From the docs https://druid.apache.org/docs/latest/development/extensions-core/kafka-ingestion.html, the definition of completionTimeout is “The length of time to wait before declaring a publishing task as failed and terminating it.”, so my thinking is that it might be normal for the task to fail if it reaches the completionTimeout specified in the config?

completionTimeout
ISO8601 Period
The length of time to wait before declaring a publishing task as failed and terminating it. If this is set too low, your tasks may never publish. The publishing clock for a task begins roughly after taskDuration elapses.
no (default == PT30M)

I increase the timeout value and it helped but I am looking at my coordinator and wondering why I have so many partitions when my segment granularity is HOUR, task duration is 5 min and completionTimeout is 10 minutes.

I thought with 5 minute task duration I would have 12 segments per hour but I have 51. (Please see attached)

"granularitySpec": {

"type": “uniform”,

"segmentGranularity": “hour”,

"queryGranularity": “NONE”,

"rollup": false

}

},

"tuningConfig": {

"type": “kafka”,

"reportParseExceptions": false,

"maxRowsInMemory": 1000000,

"maxRowsPerSegment": 5000000,

"maxTotalRows": 5000000

},

"ioConfig": {

"topic": “topic”,

"taskCount": 4,

"replicas": 1,

"taskDuration": “PT5M”,

"completionTimeout": “PT10M”,

"consumerProperties": {

"bootstrap.servers": "XXXXXXX:9092"

}

}

}

My average message rate in kafka is about 1 million per minute. Is this normal?