Middlemanagers keep disappearing

Relates to Apache Druid 24.0.0

Setup

Hi. I’m currently evaluating running Druid on a Kubernetes Cluster instead of bare metal. To that end I use the druid-io Kubernetes operator to manage druid containers and druid-kubernetes-extensions instead of Zookeeper.

I currently have 1 coordinator and router, and 2 brokers, historicals and middlemanagers set up. I’m pretty sure I’m going to scale this up later – right now my objective is to load my data into this cluster and see where it needs more performance.

What I Did

I ingest data using Kafka. To stress the server and to find out where bottlenecks are, I dumped a few gigs of our data into Kafka to see it percolate through the cluster.

The Problem

At first this worked fine for about 2-3 gigabytes of data while the middleManagers were configured to use druid.worker.capacity=1. However, when I bumped that to 2, the cluster has reached a point where all ingestion tasks keep failing with a Completion Timeout Error. When investigating that, I found out that both middleManagers are just completely gone from the UI’s “services” tab.

However, the middleManager pods in Kubernetes are still running! Not only that, I can log into them and see that the druid process is running fine. The logs do not contain any error messages as far as I can see.

The Question

How do I start debugging this? I didn’t find any error messages to google, and the fact that there are no errors in the middlemanagers makes it hard to find a specific clue.

MiddleManager Configuration

    middlemanagers:
      druid.port: 8088
      kind: Deployment
      nodeType: "middleManager"
      nodeConfigMountPath: /opt/druid/conf/druid/cluster/data/middleManager
      podDisruptionBudgetSpec:
        maxUnavailable: 1
      replicas: 1
      resources:
        limits:
          cpu: "2"
          memory: 15Gi
        requests:
          cpu: "2"
          memory: 15Gi
      livenessProbe:
        initialDelaySeconds: 60
        periodSeconds: 5
        failureThreshold: 3
        httpGet:
          path: /status/health
          port: 8088
      readinessProbe:
        initialDelaySeconds: 60
        periodSeconds: 5
        failureThreshold: 3
        httpGet:
          path: /status/health
          port: 8088
      runtime.properties: |
        druid.service=druid/middleManager
        druid.worker.capacity=4
        druid.indexer.runner.javaOpts=-server -XX:MaxDirectMemorySize=10240g -Duser.timezone=UTC -Dfile.encoding=UTF-8 -Djava.io.tmpdir=/druid/data/tmp -Dlog4j.debug -XX:+UnlockDiagnosticVMOptions -XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1 -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=50 -XX:GCLogFileSize=10m -XX:+ExitOnOutOfMemoryError -XX:+HeapDumpOnOutOfMemoryError -XX:+UseG1GC -Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager -XX:HeapDumpPath=/druid/data/logs/peon.%t.%p.hprof -Xms10G -Xmx10G
        druid.indexer.task.baseTaskDir=var/druid/task
        # HTTP server threads
        druid.server.http.numThreads=10
        druid.indexer.fork.property.druid.processing.buffer.sizeBytes=1
        druid.indexer.fork.property.druid.processing.numMergeBuffers=1
        druid.indexer.fork.property.druid.processing.numThreads=1
        # Processing threads and buffers on Peons
        druid.indexer.fork.property.druid.processing.numMergeBuffers=2
        druid.indexer.fork.property.druid.processing.buffer.sizeBytes=100000000
        druid.indexer.fork.property.druid.processing.numThreads=1
      extra.jvm.options: |-
        -Xmx4G
        -Xms4G

MiddleManager Logs

...
2022-10-31T07:13:50,561 INFO [forking-task-runner-0] org.apache.druid.indexing.overlord.ForkingTaskRunner - Running command: java -cp /tmp/conf/druid/cluster/_common:/tmp/conf/druid/cluster/data/middleManager:lib/aws-java-sdk-sts-1.12.264.jar:lib/commons-codec-1.13.jar:lib/shims-0.9.0.jar:lib/stax-ex-1.8.jar:lib/javax.activation-api-1.2.0.jar:lib/guava-16.0.1.jar:lib/guice-4.1.0.jar:lib/vavr-0.10.2.jar:lib/airline-2.8.4.jar:lib/asm-tree-9.3.jar:lib/maven-aether-provider-3.1.1.jar:lib/commons-text-1.3.jar:lib/maven-model-builder-3.1.1.jar:lib/netty-codec-dns-4.1.68.Final.jar:lib/javax.activation-1.2.0.jar:lib/druid-core-24.0.0.jar:lib/jetty-server-9.4.48.v20220622.jar:lib/istack-commons-runtime-3.0.7.jar:lib/caffeine-2.8.0.jar:lib/jcl-over-slf4j-1.7.36.jar:lib/protobuf-java-3.11.0.jar:lib/spymemcached-2.12.3.jar:lib/commons-compiler-3.0.11.jar:lib/aether-api-0.9.0.M2.jar:lib/jetty-client-9.4.48.v20220622.jar:lib/aopalliance-1.0.jar:lib/jetty-continuation-9.4.48.v20220622.jar:lib/antlr4-runtime-4.5.1.jar:lib/commons-pool2-2.2.jar:lib/plexus-utils-3.0.24.jar:lib/reactive-streams-1.0.2.jar:lib/netty-handler-4.1.68.Final.jar:lib/jersey-guice-1.19.4.jar:lib/fastutil-8.5.4.jar:lib/ipaddress-5.3.4.jar:lib/derbynet-10.14.2.0.jar:lib/netty-handler-proxy-4.1.68.Final.jar:lib/resilience4j-core-1.3.1.jar:lib/aws-java-sdk-core-1.12.264.jar:lib/httpcore-4.4.11.jar:lib/netty-resolver-4.1.68.Final.jar:lib/curator-client-4.3.0.jar:lib/rhino-1.7.11.jar:lib/RoaringBitmap-0.9.0.jar:lib/druid-console-24.0.0.jar:lib/google-api-client-1.26.0.jar:lib/validation-api-1.1.0.Final.jar:lib/calcite-linq4j-1.21.0.jar:lib/commons-io-2.11.0.jar:lib/FastInfoset-1.2.15.jar:lib/aggdesigner-algorithm-6.0.jar:lib/netty-buffer-4.1.68.Final.jar:lib/netty-transport-4.1.68.Final.jar:lib/aws-java-sdk-s3-1.12.264.jar:lib/google-http-client-jackson2-1.26.0.jar:lib/tesla-aether-0.0.5.jar:lib/jsr305-2.0.1.jar:lib/netty-common-4.1.68.Final.jar:lib/async-http-client-2.5.3.jar:lib/commons-beanutils-1.9.4.jar:lib/json-path-2.3.0.jar:lib/lz4-java-1.8.0.jar:lib/jetty-http-9.4.48.v20220622.jar:lib/avatica-metrics-1.17.0.jar:lib/druid-services-24.0.0.jar:lib/ion-java-1.0.2.jar:lib/jackson-jaxrs-base-2.10.5.jar:lib/jackson-module-guice-2.10.5.jar:lib/checker-qual-2.5.7.jar:lib/druid-hll-24.0.0.jar:lib/javax.servlet-api-3.1.0.jar:lib/druid-sql-24.0.0.jar:lib/joda-time-2.10.5.jar:lib/slf4j-api-1.7.36.jar:lib/commons-math3-3.6.1.jar:lib/netty-codec-4.1.68.Final.jar:lib/netty-codec-socks-4.1.68.Final.jar:lib/jboss-logging-3.2.1.Final.jar:lib/compress-lzf-1.0.4.jar:lib/jackson-jaxrs-smile-provider-2.10.5.jar:lib/fastutil-extra-8.5.4.jar:lib/log4j-1.2-api-2.18.0.jar:lib/jsr311-api-1.1.1.jar:lib/datasketches-memory-2.0.0.jar:lib/fastutil-core-8.5.4.jar:lib/hibernate-validator-5.2.5.Final.jar:lib/opencsv-4.6.jar:lib/error_prone_annotations-2.11.0.jar:lib/maven-settings-builder-3.1.1.jar:lib/jackson-databind-2.10.5.1.jar:lib/jackson-dataformat-cbor-2.10.5.jar:lib/guice-assistedinject-4.1.0.jar:lib/druid-indexing-service-24.0.0.jar:lib/log4j-jul-2.18.0.jar:lib/classmate-1.1.0.jar:lib/commons-dbcp2-2.0.1.jar:lib/maven-artifact-3.6.0.jar:lib/aws-java-sdk-kms-1.12.264.jar:lib/aether-spi-0.9.0.M2.jar:lib/wagon-provider-api-2.4.jar:lib/asm-analysis-9.3.jar:lib/asm-commons-9.3.jar:lib/config-magic-0.9.jar:lib/maven-repository-metadata-3.1.1.jar:lib/guice-servlet-4.1.0.jar:lib/jaxb-api-2.3.1.jar:lib/google-oauth-client-1.26.0.jar:lib/aether-util-0.9.0.M2.jar:lib/avatica-server-1.17.0.jar:lib/httpclient-4.5.13.jar:lib/jackson-jq-0.0.10.jar:lib/jersey-core-1.19.4.jar:lib/druid-processing-24.0.0.jar:lib/jackson-dataformat-smile-2.10.5.jar:lib/druid-gcp-common-24.0.0.jar:lib/j2objc-annotations-1.1.jar:lib/jetty-util-ajax-9.4.48.v20220622.jar:lib/calcite-core-1.21.0.jar:lib/commons-lang-2.6.jar:lib/log4j-api-2.18.0.jar:lib/json-smart-2.3.jar:lib/txw2-2.3.1.jar:lib/zookeeper-jute-3.5.9.jar:lib/aether-connector-okhttp-0.0.9.jar:lib/netty-transport-native-epoll-4.1.68.Final-linux-x86_64.jar:lib/jakarta.activation-api-1.2.1.jar:lib/datasketches-java-3.2.0.jar:lib/jackson-annotations-2.10.5.jar:lib/jetty-rewrite-9.4.48.v20220622.jar:lib/jetty-util-9.4.48.v20220622.jar:lib/google-http-client-1.26.0.jar:lib/extendedset-24.0.0.jar:lib/netty-3.10.6.Final.jar:lib/jackson-jaxrs-json-provider-2.10.5.jar:lib/plexus-interpolation-1.19.jar:lib/jaxb-runtime-2.3.1.jar:lib/commons-collections-3.2.2.jar:lib/resilience4j-bulkhead-1.3.1.jar:lib/maven-model-3.1.1.jar:lib/jakarta.inject-api-1.0.3.jar:lib/curator-framework-4.3.0.jar:lib/jcodings-1.0.43.jar:lib/aws-java-sdk-ec2-1.12.264.jar:lib/cron-scheduler-0.1.jar:lib/jvm-attach-api-1.5.jar:lib/derby-10.14.2.0.jar:lib/accessors-smart-1.2.jar:lib/async-http-client-netty-utils-2.5.3.jar:lib/jetty-proxy-9.4.48.v20220622.jar:lib/xz-1.8.jar:lib/icu4j-55.1.jar:lib/maven-settings-3.1.1.jar:lib/jdbi-2.63.1.jar:lib/disruptor-3.3.6.jar:lib/druid-server-24.0.0.jar:lib/vavr-match-0.10.2.jar:lib/netty-resolver-dns-4.1.68.Final.jar:lib/jetty-security-9.4.48.v20220622.jar:lib/javax.el-api-3.0.0.jar:lib/zstd-jni-1.5.2-3.jar:lib/audience-annotations-0.5.0.jar:lib/javax.el-3.0.0.jar:lib/jackson-core-2.10.5.jar:lib/metrics-core-4.0.0.jar:lib/commons-lang3-3.8.1.jar:lib/jetty-io-9.4.48.v20220622.jar:lib/curator-recipes-4.3.0.jar:lib/esri-geometry-api-2.2.0.jar:lib/javax.inject-1.jar:lib/aether-impl-0.9.0.M2.jar:lib/jackson-datatype-guava-2.10.5.jar:lib/jna-4.5.1.jar:lib/jmespath-java-1.12.264.jar:lib/commons-collections4-4.2.jar:lib/jersey-server-1.19.4.jar:lib/jackson-module-jaxb-annotations-2.10.5.jar:lib/commons-compress-1.21.jar:lib/log4j-slf4j-impl-2.18.0.jar:lib/commons-logging-1.1.1.jar:lib/avatica-core-1.17.0.jar:lib/curator-x-discovery-4.3.0.jar:lib/zookeeper-3.5.9.jar:lib/jetty-servlets-9.4.48.v20220622.jar:lib/netty-reactive-streams-2.0.0.jar:lib/guice-multibindings-4.1.0.jar:lib/jetty-servlet-9.4.48.v20220622.jar:lib/netty-codec-http-4.1.68.Final.jar:lib/druid-aws-common-24.0.0.jar:lib/jakarta.xml.bind-api-2.3.2.jar:lib/okhttp-1.0.2.jar:lib/jackson-datatype-joda-2.10.5.jar:lib/log4j-core-2.18.0.jar:lib/sigar-1.6.5.132.jar:lib/netty-transport-native-unix-common-4.1.68.Final.jar:lib/jersey-servlet-1.19.4.jar:lib/druid-indexing-hadoop-24.0.0.jar:lib/asm-9.3.jar:lib/janino-3.0.11.jar:lib/joni-2.1.27.jar:lib/derbyclient-10.14.2.0.jar:lib/airline-io-2.8.4.jar:lib/aether-connector-file-0.9.0.M2.jar: -XX:ActiveProcessorCount=4 -server -XX:MaxDirectMemorySize=10240g -Duser.timezone=UTC -Dfile.encoding=UTF-8 -Djava.io.tmpdir=/druid/data/tmp -Dlog4j.debug -XX:+UnlockDiagnosticVMOptions -XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1 -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=50 -XX:GCLogFileSize=10m -XX:+ExitOnOutOfMemoryError -XX:+HeapDumpOnOutOfMemoryError -XX:+UseG1GC -Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager -XX:HeapDumpPath=/druid/data/logs/peon.%t.%p.hprof -Xms1G -Xmx1G -Ddruid.indexer.task.baseTaskDir=var/druid/task -Ddruid.host=10.244.23.36 -Ddruid.serverview.type=http -Ddruid.metadata.storage.connector.password=<masked> -Ddruid.indexer.fork.property.druid.processing.numThreads=1 -Ddruid.azure.key=YKzeQlpW76zXIgBmLTMO/iFWsplQTvf+MYff4s4UxpPGYswH8msv2fU56VNkwCtVjXmTpfipjknf+AStLuA5Ng== -Ddruid.indexer.fork.property.druid.processing.buffer.sizeBytes=32000000 -Ddruid.discovery.k8s.clusterIdentifier=td-cube-cluster -Duser.timezone=UTC -Dfile.encoding.pkg=sun.io -Ddruid.selectors.coordinator.serviceName=druid/coordinator -Ddruid.selectors.indexing.serviceName=druid/overlord -Ddruid.lookup.enableLookupSyncOnStartup=false -Ddruid.port=8088 -Ddruid.server.http.numThreads=1 -Ddruid.worker.capacity=2 -Ddruid.azure.container=druid -Ddruid.azure.account=telemetrydeckcoldstorage -Ddruid.zk.service.enabled=false -Ddruid.service=druid/middleManager -Ddruid.metadata.storage.connector.user=telemetryadmin -Ddruid.indexer.runner.type=httpRemote -Ddruid.metadata.storage.type=postgresql -Ddruid.metadata.storage.connector.connectURI=jdbc:postgresql://playground-postgres-metadata-public.postgres.database.azure.com:5432/druid -Ddruid.coordinator.loadqueuepeon.type=http -Djava.io.tmpdir=/druid/data -Ddruid.extensions.loadList=["druid-kubernetes-extensions", "druid-kafka-indexing-service", "druid-azure-extensions", "postgresql-metadata-storage", "druid-datasketches"] -Ddruid.discovery.type=k8s -Ddruid.indexer.logs.directory=/druid/data/indexing-logs -Dfile.encoding=UTF-8 -Ddruid.storage.type=azure -Ddruid.indexer.fork.property.druid.processing.numMergeBuffers=2 -Ddruid.indexer.logs.type=file -Ddruid.processing.numThreads=1 -Ddruid.processing.buffer.sizeBytes=32000000 -Ddruid.processing.numMergeBuffers=2 -Ddruid.metrics.emitter.dimension.dataSource=telemetry-signals -Ddruid.metrics.emitter.dimension.taskId=index_kafka_telemetry-signals_acf5149bd6e81ae_jjipfpmd -Ddruid.metrics.emitter.dimension.taskType=index_kafka -Ddruid.host=10.244.23.36 -Ddruid.plaintextPort=8100 -Ddruid.tlsPort=-1 -Ddruid.task.executor.service=druid/middleManager -Ddruid.task.executor.host=10.244.23.36 -Ddruid.task.executor.plaintextPort=8088 -Ddruid.task.executor.enablePlaintextPort=true -Ddruid.task.executor.tlsPort=-1 -Ddruid.task.executor.enableTlsPort=false -Dlog4j2.configurationFactory=org.apache.druid.indexing.common.tasklogs.ConsoleLoggingEnforcementConfigurationFactory org.apache.druid.cli.Main internal peon var/druid/task/index_kafka_telemetry-signals_acf5149bd6e81ae_jjipfpmd/task.json var/druid/task/index_kafka_telemetry-signals_acf5149bd6e81ae_jjipfpmd/93fb2af6-0838-4249-afa5-b27e1fd46818/status.json var/druid/task/index_kafka_telemetry-signals_acf5149bd6e81ae_jjipfpmd/93fb2af6-0838-4249-afa5-b27e1fd46818/report.json --loadBroadcastSegments true
TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.druid.java.util.common.logger.Logger
2022-10-31T07:13:50,567 INFO [forking-task-runner-0] org.apache.druid.indexing.overlord.ForkingTaskRunner - Logging task index_kafka_telemetry-signals_acf5149bd6e81ae_jjipfpmd output to: var/druid/task/index_kafka_telemetry-signals_acf5149bd6e81ae_jjipfpmd/log
TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.druid.java.util.common.logger.Logger
TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.druid.java.util.common.logger.Logger
2022-10-31T07:13:50,637 INFO [WorkerTaskManager-NoticeHandler] org.apache.druid.indexing.worker.WorkerTaskManager - Task[coordinator-issued_compact_telemetry-signals_gmohdmie_2022-10-30T20:00:32.744Z] started.
2022-10-31T07:13:50,640 INFO [forking-task-runner-1] org.apache.druid.indexing.overlord.ForkingTaskRunner - Running command: java -cp /tmp/conf/druid/cluster/_common:/tmp/conf/druid/cluster/data/middleManager:lib/aws-java-sdk-sts-1.12.264.jar:lib/commons-codec-1.13.jar:lib/shims-0.9.0.jar:lib/stax-ex-1.8.jar:lib/javax.activation-api-1.2.0.jar:lib/guava-16.0.1.jar:lib/guice-4.1.0.jar:lib/vavr-0.10.2.jar:lib/airline-2.8.4.jar:lib/asm-tree-9.3.jar:lib/maven-aether-provider-3.1.1.jar:lib/commons-text-1.3.jar:lib/maven-model-builder-3.1.1.jar:lib/netty-codec-dns-4.1.68.Final.jar:lib/javax.activation-1.2.0.jar:lib/druid-core-24.0.0.jar:lib/jetty-server-9.4.48.v20220622.jar:lib/istack-commons-runtime-3.0.7.jar:lib/caffeine-2.8.0.jar:lib/jcl-over-slf4j-1.7.36.jar:lib/protobuf-java-3.11.0.jar:lib/spymemcached-2.12.3.jar:lib/commons-compiler-3.0.11.jar:lib/aether-api-0.9.0.M2.jar:lib/jetty-client-9.4.48.v20220622.jar:lib/aopalliance-1.0.jar:lib/jetty-continuation-9.4.48.v20220622.jar:lib/antlr4-runtime-4.5.1.jar:lib/commons-pool2-2.2.jar:lib/plexus-utils-3.0.24.jar:lib/reactive-streams-1.0.2.jar:lib/netty-handler-4.1.68.Final.jar:lib/jersey-guice-1.19.4.jar:lib/fastutil-8.5.4.jar:lib/ipaddress-5.3.4.jar:lib/derbynet-10.14.2.0.jar:lib/netty-handler-proxy-4.1.68.Final.jar:lib/resilience4j-core-1.3.1.jar:lib/aws-java-sdk-core-1.12.264.jar:lib/httpcore-4.4.11.jar:lib/netty-resolver-4.1.68.Final.jar:lib/curator-client-4.3.0.jar:lib/rhino-1.7.11.jar:lib/RoaringBitmap-0.9.0.jar:lib/druid-console-24.0.0.jar:lib/google-api-client-1.26.0.jar:lib/validation-api-1.1.0.Final.jar:lib/calcite-linq4j-1.21.0.jar:lib/commons-io-2.11.0.jar:lib/FastInfoset-1.2.15.jar:lib/aggdesigner-algorithm-6.0.jar:lib/netty-buffer-4.1.68.Final.jar:lib/netty-transport-4.1.68.Final.jar:lib/aws-java-sdk-s3-1.12.264.jar:lib/google-http-client-jackson2-1.26.0.jar:lib/tesla-aether-0.0.5.jar:lib/jsr305-2.0.1.jar:lib/netty-common-4.1.68.Final.jar:lib/async-http-client-2.5.3.jar:lib/commons-beanutils-1.9.4.jar:lib/json-path-2.3.0.jar:lib/lz4-java-1.8.0.jar:lib/jetty-http-9.4.48.v20220622.jar:lib/avatica-metrics-1.17.0.jar:lib/druid-services-24.0.0.jar:lib/ion-java-1.0.2.jar:lib/jackson-jaxrs-base-2.10.5.jar:lib/jackson-module-guice-2.10.5.jar:lib/checker-qual-2.5.7.jar:lib/druid-hll-24.0.0.jar:lib/javax.servlet-api-3.1.0.jar:lib/druid-sql-24.0.0.jar:lib/joda-time-2.10.5.jar:lib/slf4j-api-1.7.36.jar:lib/commons-math3-3.6.1.jar:lib/netty-codec-4.1.68.Final.jar:lib/netty-codec-socks-4.1.68.Final.jar:lib/jboss-logging-3.2.1.Final.jar:lib/compress-lzf-1.0.4.jar:lib/jackson-jaxrs-smile-provider-2.10.5.jar:lib/fastutil-extra-8.5.4.jar:lib/log4j-1.2-api-2.18.0.jar:lib/jsr311-api-1.1.1.jar:lib/datasketches-memory-2.0.0.jar:lib/fastutil-core-8.5.4.jar:lib/hibernate-validator-5.2.5.Final.jar:lib/opencsv-4.6.jar:lib/error_prone_annotations-2.11.0.jar:lib/maven-settings-builder-3.1.1.jar:lib/jackson-databind-2.10.5.1.jar:lib/jackson-dataformat-cbor-2.10.5.jar:lib/guice-assistedinject-4.1.0.jar:lib/druid-indexing-service-24.0.0.jar:lib/log4j-jul-2.18.0.jar:lib/classmate-1.1.0.jar:lib/commons-dbcp2-2.0.1.jar:lib/maven-artifact-3.6.0.jar:lib/aws-java-sdk-kms-1.12.264.jar:lib/aether-spi-0.9.0.M2.jar:lib/wagon-provider-api-2.4.jar:lib/asm-analysis-9.3.jar:lib/asm-commons-9.3.jar:lib/config-magic-0.9.jar:lib/maven-repository-metadata-3.1.1.jar:lib/guice-servlet-4.1.0.jar:lib/jaxb-api-2.3.1.jar:lib/google-oauth-client-1.26.0.jar:lib/aether-util-0.9.0.M2.jar:lib/avatica-server-1.17.0.jar:lib/httpclient-4.5.13.jar:lib/jackson-jq-0.0.10.jar:lib/jersey-core-1.19.4.jar:lib/druid-processing-24.0.0.jar:lib/jackson-dataformat-smile-2.10.5.jar:lib/druid-gcp-common-24.0.0.jar:lib/j2objc-annotations-1.1.jar:lib/jetty-util-ajax-9.4.48.v20220622.jar:lib/calcite-core-1.21.0.jar:lib/commons-lang-2.6.jar:lib/log4j-api-2.18.0.jar:lib/json-smart-2.3.jar:lib/txw2-2.3.1.jar:lib/zookeeper-jute-3.5.9.jar:lib/aether-connector-okhttp-0.0.9.jar:lib/netty-transport-native-epoll-4.1.68.Final-linux-x86_64.jar:lib/jakarta.activation-api-1.2.1.jar:lib/datasketches-java-3.2.0.jar:lib/jackson-annotations-2.10.5.jar:lib/jetty-rewrite-9.4.48.v20220622.jar:lib/jetty-util-9.4.48.v20220622.jar:lib/google-http-client-1.26.0.jar:lib/extendedset-24.0.0.jar:lib/netty-3.10.6.Final.jar:lib/jackson-jaxrs-json-provider-2.10.5.jar:lib/plexus-interpolation-1.19.jar:lib/jaxb-runtime-2.3.1.jar:lib/commons-collections-3.2.2.jar:lib/resilience4j-bulkhead-1.3.1.jar:lib/maven-model-3.1.1.jar:lib/jakarta.inject-api-1.0.3.jar:lib/curator-framework-4.3.0.jar:lib/jcodings-1.0.43.jar:lib/aws-java-sdk-ec2-1.12.264.jar:lib/cron-scheduler-0.1.jar:lib/jvm-attach-api-1.5.jar:lib/derby-10.14.2.0.jar:lib/accessors-smart-1.2.jar:lib/async-http-client-netty-utils-2.5.3.jar:lib/jetty-proxy-9.4.48.v20220622.jar:lib/xz-1.8.jar:lib/icu4j-55.1.jar:lib/maven-settings-3.1.1.jar:lib/jdbi-2.63.1.jar:lib/disruptor-3.3.6.jar:lib/druid-server-24.0.0.jar:lib/vavr-match-0.10.2.jar:lib/netty-resolver-dns-4.1.68.Final.jar:lib/jetty-security-9.4.48.v20220622.jar:lib/javax.el-api-3.0.0.jar:lib/zstd-jni-1.5.2-3.jar:lib/audience-annotations-0.5.0.jar:lib/javax.el-3.0.0.jar:lib/jackson-core-2.10.5.jar:lib/metrics-core-4.0.0.jar:lib/commons-lang3-3.8.1.jar:lib/jetty-io-9.4.48.v20220622.jar:lib/curator-recipes-4.3.0.jar:lib/esri-geometry-api-2.2.0.jar:lib/javax.inject-1.jar:lib/aether-impl-0.9.0.M2.jar:lib/jackson-datatype-guava-2.10.5.jar:lib/jna-4.5.1.jar:lib/jmespath-java-1.12.264.jar:lib/commons-collections4-4.2.jar:lib/jersey-server-1.19.4.jar:lib/jackson-module-jaxb-annotations-2.10.5.jar:lib/commons-compress-1.21.jar:lib/log4j-slf4j-impl-2.18.0.jar:lib/commons-logging-1.1.1.jar:lib/avatica-core-1.17.0.jar:lib/curator-x-discovery-4.3.0.jar:lib/zookeeper-3.5.9.jar:lib/jetty-servlets-9.4.48.v20220622.jar:lib/netty-reactive-streams-2.0.0.jar:lib/guice-multibindings-4.1.0.jar:lib/jetty-servlet-9.4.48.v20220622.jar:lib/netty-codec-http-4.1.68.Final.jar:lib/druid-aws-common-24.0.0.jar:lib/jakarta.xml.bind-api-2.3.2.jar:lib/okhttp-1.0.2.jar:lib/jackson-datatype-joda-2.10.5.jar:lib/log4j-core-2.18.0.jar:lib/sigar-1.6.5.132.jar:lib/netty-transport-native-unix-common-4.1.68.Final.jar:lib/jersey-servlet-1.19.4.jar:lib/druid-indexing-hadoop-24.0.0.jar:lib/asm-9.3.jar:lib/janino-3.0.11.jar:lib/joni-2.1.27.jar:lib/derbyclient-10.14.2.0.jar:lib/airline-io-2.8.4.jar:lib/aether-connector-file-0.9.0.M2.jar: -XX:ActiveProcessorCount=4 -server -XX:MaxDirectMemorySize=10240g -Duser.timezone=UTC -Dfile.encoding=UTF-8 -Djava.io.tmpdir=/druid/data/tmp -Dlog4j.debug -XX:+UnlockDiagnosticVMOptions -XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1 -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=50 -XX:GCLogFileSize=10m -XX:+ExitOnOutOfMemoryError -XX:+HeapDumpOnOutOfMemoryError -XX:+UseG1GC -Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager -XX:HeapDumpPath=/druid/data/logs/peon.%t.%p.hprof -Xms1G -Xmx1G -Ddruid.indexer.task.baseTaskDir=var/druid/task -Ddruid.host=10.244.23.36 -Ddruid.serverview.type=http -Ddruid.metadata.storage.connector.password=<masked> -Ddruid.indexer.fork.property.druid.processing.numThreads=1 -Ddruid.azure.key=YKzeQlpW76zXIgBmLTMO/iFWsplQTvf+MYff4s4UxpPGYswH8msv2fU56VNkwCtVjXmTpfipjknf+AStLuA5Ng== -Ddruid.indexer.fork.property.druid.processing.buffer.sizeBytes=32000000 -Ddruid.discovery.k8s.clusterIdentifier=td-cube-cluster -Duser.timezone=UTC -Dfile.encoding.pkg=sun.io -Ddruid.selectors.coordinator.serviceName=druid/coordinator -Ddruid.selectors.indexing.serviceName=druid/overlord -Ddruid.lookup.enableLookupSyncOnStartup=false -Ddruid.port=8088 -Ddruid.server.http.numThreads=1 -Ddruid.worker.capacity=2 -Ddruid.azure.container=druid -Ddruid.azure.account=telemetrydeckcoldstorage -Ddruid.zk.service.enabled=false -Ddruid.service=druid/middleManager -Ddruid.metadata.storage.connector.user=telemetryadmin -Ddruid.indexer.runner.type=httpRemote -Ddruid.metadata.storage.type=postgresql -Ddruid.metadata.storage.connector.connectURI=jdbc:postgresql://playground-postgres-metadata-public.postgres.database.azure.com:5432/druid -Ddruid.coordinator.loadqueuepeon.type=http -Djava.io.tmpdir=/druid/data -Ddruid.extensions.loadList=["druid-kubernetes-extensions", "druid-kafka-indexing-service", "druid-azure-extensions", "postgresql-metadata-storage", "druid-datasketches"] -Ddruid.discovery.type=k8s -Ddruid.indexer.logs.directory=/druid/data/indexing-logs -Dfile.encoding=UTF-8 -Ddruid.storage.type=azure -Ddruid.indexer.fork.property.druid.processing.numMergeBuffers=2 -Ddruid.indexer.logs.type=file -Ddruid.processing.numThreads=1 -Ddruid.processing.buffer.sizeBytes=32000000 -Ddruid.processing.numMergeBuffers=2 -Ddruid.metrics.emitter.dimension.dataSource=telemetry-signals -Ddruid.metrics.emitter.dimension.taskId=coordinator-issued_compact_telemetry-signals_gmohdmie_2022-10-30T20:00:32.744Z -Ddruid.metrics.emitter.dimension.taskType=compact -Ddruid.host=10.244.23.36 -Ddruid.plaintextPort=8101 -Ddruid.tlsPort=-1 -Ddruid.task.executor.service=druid/middleManager -Ddruid.task.executor.host=10.244.23.36 -Ddruid.task.executor.plaintextPort=8088 -Ddruid.task.executor.enablePlaintextPort=true -Ddruid.task.executor.tlsPort=-1 -Ddruid.task.executor.enableTlsPort=false -Dlog4j2.configurationFactory=org.apache.druid.indexing.common.tasklogs.ConsoleLoggingEnforcementConfigurationFactory org.apache.druid.cli.Main internal peon var/druid/task/coordinator-issued_compact_telemetry-signals_gmohdmie_2022-10-30T20:00:32.744Z/task.json var/druid/task/coordinator-issued_compact_telemetry-signals_gmohdmie_2022-10-30T20:00:32.744Z/308003fb-ebdc-497c-968e-6f5a04fa9cdd/status.json var/druid/task/coordinator-issued_compact_telemetry-signals_gmohdmie_2022-10-30T20:00:32.744Z/308003fb-ebdc-497c-968e-6f5a04fa9cdd/report.json
2022-10-31T07:13:50,647 INFO [forking-task-runner-1] org.apache.druid.indexing.overlord.ForkingTaskRunner - Logging task coordinator-issued_compact_telemetry-signals_gmohdmie_2022-10-30T20:00:32.744Z output to: var/druid/task/coordinator-issued_compact_telemetry-signals_gmohdmie_2022-10-30T20:00:32.744Z/log
2022-10-31T07:26:46,303 INFO [forking-task-runner-0] org.apache.druid.indexing.common.tasklogs.FileTaskLogs - Wrote task log to: /druid/data/indexing-logs/index_kafka_telemetry-signals_acf5149bd6e81ae_jjipfpmd.log
2022-10-31T07:26:46,303 INFO [forking-task-runner-0] org.apache.druid.indexing.common.tasklogs.FileTaskLogs - Wrote task report to: /druid/data/indexing-logs/index_kafka_telemetry-signals_acf5149bd6e81ae_jjipfpmd.report.json
2022-10-31T07:26:46,303 INFO [forking-task-runner-0] org.apache.druid.indexing.overlord.ForkingTaskRunner - Process exited successfully for task: index_kafka_telemetry-signals_acf5149bd6e81ae_jjipfpmd
2022-10-31T07:26:46,316 INFO [forking-task-runner-0] org.apache.druid.indexing.overlord.ForkingTaskRunner - Removing task directory: var/druid/task/index_kafka_telemetry-signals_acf5149bd6e81ae_jjipfpmd
2022-10-31T07:26:46,360 INFO [WorkerTaskManager-NoticeHandler] org.apache.druid.indexing.worker.WorkerTaskManager - Task [index_kafka_telemetry-signals_acf5149bd6e81ae_jjipfpmd] completed with status [SUCCESS].
TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.jboss.netty.logging.Slf4JLoggerFactory
TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.jboss.netty.logging.Slf4JLoggerFactory
TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.jboss.netty.logging.Slf4JLoggerFactory
TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.jboss.netty.logging.Slf4JLoggerFactory
TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.jboss.netty.logging.Slf4JLoggerFactory

Update: I found an error message, third-last-line:

2022-10-31T07:52:02,721 INFO [forking-task-runner-0] org.apache.druid.indexing.overlord.ForkingTaskRunner - Logging task index_kafka_telemetry-signals_3a9d1b590bc699e_jafiikld output to: var/druid/task/index_kafka_telemetry-signals_3a9d1b590bc699e_jafiikld/log
2022-10-31T07:52:08,866 INFO [org.apache.druid.k8s.discovery.K8sDruidNodeDiscoveryProvider$NodeRoleWatcheroverlord] org.apache.druid.discovery.BaseNodeRoleWatcher - Node [http://10.244.23.40:8088] of role [overlord] detected.
2022-10-31T07:53:09,003 INFO [org.apache.druid.k8s.discovery.K8sDruidNodeDiscoveryProvider$NodeRoleWatcheroverlord] org.apache.druid.discovery.BaseNodeRoleWatcher - Node [http://10.244.22.31:8088] of role [overlord] went offline.
TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.druid.java.util.common.logger.Logger
2022-10-31T07:54:13,258 INFO [qtp619621481-18] org.apache.druid.indexing.overlord.ForkingTaskRunner - Shutdown [index_kafka_telemetry-signals_3a9d1b590bc699e_jafiikld] because: [shut down request via HTTP endpoint]
2022-10-31T07:54:13,258 INFO [qtp619621481-18] org.apache.druid.indexing.overlord.ForkingTaskRunner - Closing output stream to task[index_kafka_telemetry-signals_3a9d1b590bc699e_jafiikld].
2022-10-31T07:55:08,345 INFO [forking-task-runner-0] org.apache.druid.indexing.common.tasklogs.FileTaskLogs - Wrote task log to: /druid/data/indexing-logs/index_kafka_telemetry-signals_3a9d1b590bc699e_jafiikld.log
2022-10-31T07:55:08,345 INFO [forking-task-runner-0] org.apache.druid.indexing.common.tasklogs.FileTaskLogs - Wrote task report to: /druid/data/indexing-logs/index_kafka_telemetry-signals_3a9d1b590bc699e_jafiikld.report.json
2022-10-31T07:55:08,346 ERROR [forking-task-runner-0] org.apache.druid.indexing.overlord.ForkingTaskRunner - Process exited with code[2] for task: index_kafka_telemetry-signals_3a9d1b590bc699e_jafiikld
2022-10-31T07:55:08,353 INFO [forking-task-runner-0] org.apache.druid.indexing.overlord.ForkingTaskRunner - Removing task directory: var/druid/task/index_kafka_telemetry-signals_3a9d1b590bc699e_jafiikld
2022-10-31T07:55:09,553 INFO [WorkerTaskManager-NoticeHandler] org.apache.druid.indexing.worker.WorkerTaskManager - Task [index_kafka_telemetry-signals_3a9d1b590bc699e_jafiikld] completed with status [FAILED].

Here’s the task log, slightly formatted for better readability:

{
  "ingestionStatsAndErrors": {
    "type": "ingestionStatsAndErrors",
    "taskId": "index_kafka_telemetry-signals_3a9d1b590bc699e_jafiikld",
    "payload": {
      "ingestionState": "BUILD_SEGMENTS",
      "unparseableEvents": {},
      "rowStats": {
        "buildSegments": {
          "processed": 12431,
          "processedWithError": 0,
          "thrownAway": 0,
          "unparseable": 0
        }
      },
      "errorMsg": "org.apache.kafka.common.errors.InterruptException: java.lang.InterruptedException
        at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.maybeThrowInterruptException(ConsumerNetworkClient.java:520)
        at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:281)
        at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:236)
        at org.apache.kafka.clients.consumer.KafkaConsumer.pollForFetches(KafkaConsumer.java:1306)
        at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1242)
        at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1215)
        at org.apache.druid.indexing.kafka.KafkaRecordSupplier.poll(KafkaRecordSupplier.java:129)
        at org.apache.druid.indexing.kafka.IncrementalPublishingKafkaIndexTaskRunner.getRecords(IncrementalPublishingKafkaIndexTaskRunner.java:95)
        at org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner.runInternal(SeekableStreamIndexTaskRunner.java:614)
        at org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner.run(SeekableStreamIndexTaskRunner.java:266)
        at org.apache.druid.indexing.seekablestream.SeekableStreamIndexTask.run(SeekableStreamIndexTask.java:151)
        at org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner$SingleTaskBackgroundRunnerCallable.call(SingleTaskBackgroundRunner.java:477)
        at org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner$SingleTaskBackgroundRunnerCallable.call(SingleTaskBackgroundRunner.java:449)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
        Suppressed: java.lang.InterruptedException
                at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:404)
                at java.util.concurrent.FutureTask.get(FutureTask.java:191)
                at org.apache.druid.segment.realtime.appenderator.StreamAppenderatorDriver.persist(StreamAppenderatorDriver.java:233)
                at org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner.runInternal(SeekableStreamIndexTaskRunner.java:772)
                ... 8 more
Caused by: java.lang.InterruptedException
        ... 17 more
",
      "segmentAvailabilityConfirmed": false,
      "segmentAvailabilityWaitTimeMs": 0
    }
  }
}

I think you’ll need one more cpu per MM in the pod resource limits. The worker capacity that is recommended to avoid this saturation is cpus - 1 and with your current setup you have 2 cpus and 2 workers.

Also noticed that you have 10GB of heap in javaOptsArray, with 10GB each 2 workers = 20GB + memory used by the MM JVM which together are way over the 15GB you are assigning the pod. Set the resource limit on memory high enough to accommodate MM heap + worker heap * worker capacity.

Hi Sergio, thanks for your answer!

I tried the following to test your theory that this is a memory or CPU issue:

  • I increased the CPU capacity per middleManager to 3
  • I set druid.worker.capacity to 1
  • I changed druid.indexer.runner.javaOpts so that it requests -Xms4G -Xmx4G instead of 10 gigs

With 15 gigs of RAM and 3 CPUs for just one worker task, this should easily run, but it doesn’t. The issue is exactly the same: the whole middleManager disappears from the Services list and never comes back.

I think I’m going to try switching from druid-kubernetes-extensions to Zookeeper next and see if that will change anything. I’ll report back.

Update: I’ve switched the cluster over to using Zookeeper and it has just successfully finished a task without the MiddleManager disappearing. So it seems to be an issue with druid-kubernetes-extension.

I know there’s been some recent work on the kubernetes extension.

You could try that fix to see if that resolves your issue or wait until it is integrated into an upcoming release.

1 Like

Good to hear, these people are doing an excellent job! I’ll keep a watchful eye on them and try again in upcoming releases!

1 Like

We seem to have the same problem, but we have issues with either option.
Using Kubernetes leader election causes this issue, and using ZooKeeper leader election breaks because of this issue:

Does anyone have other ideas on how to work around these issues?

There are no error logs in our case, not on the Coordinator, the Overlord, the Middle Manager, or the tasks.

The tasks even complete successfully when they get assigned to a middle manager. Otherwise, they timeout in the pending state while waiting to be assigned.

I opened an issue about it. Hope someone can shed some light on this.

For me, the issue seems to also have been related to memory issues during compaction. Giving the middlemanagers container a lot more memory fixed the issue.