Impact on segment size and count using Kafka indexing service (having different taskCount value) reading from same Kafka topic

Hi,

I have a Kafka topic with 20 partitions, and data has been published to the Kafka topic for the past 10 days. I created several 3 Druid supervisor tasks to read from same Kafka topic (each one creates a new data source) with below configuration i.e. varying number of task count per supervisor task reading from same Kafka topic

Druid Supervisor task 1 - useEarliestOffset - true, ioConfig.replicas = 2, taskCount = 1

Druid Supervisor task 2 - useEarliestOffset - true, ioConfig.replicas = 2, taskCount = 3

Druid Supervisor task 3 - useEarliestOffset - true, ioConfig.replicas = 2, taskCount = 5

Below is my supervisor template:

{
  "type": "kafka",
  "dataSchema": {
    "dataSource": "datasource_name",
    "parser": {
      "type": "avro_stream",
      "avroBytesDecoder": {
        "type": "schema_registry",
        "url": "SCHEMA-REGISTRY-ADDRESS"
      },
      "parseSpec": {
        "format": "avro",
        "timestampSpec": {
          "column": "date_time_utc",
          "format": "yyyy-MM-dd HH:mm:ss"
        },
        "flattenSpec": {
          "useFieldDiscovery": true,
          "fields": [
            {
              "name": "suite_id",
              "type": "path",
              "expr": "$.suite.id"
            },
            {
              "name": "suite_name",
              "type": "path",
              "expr": "$.suite.name"
            }
          ]
        },
        "dimensionsSpec": {
          "dimensions": [
            "suite_name",
            "suite_source"
           
          ],
          "dimensionExclusions": [],
          "spatialDimensions": []
        }
      }
    },
    "metricsSpec": [
      {
        "type": "hyperUnique",
        "name": "unique_suite_ids",
        "fieldName": "suite_id"
      }
    ],
    "granularitySpec": {
      "type": "uniform",
      "segmentGranularity": "HOUR",
      "queryGranularity": "HOUR",
      "rollup": true
    }
  },
  "tuningConfig": {
    "type": "kafka",
    "maxSavedParseExceptions": 100,
    "resetOffsetAutomatically": true
  },
  "ioConfig": {
    "topic": "kafka_topic",
    "useEarliestOffset": true,
    "replicas": "2",
    "taskCount": "{varying-values}",
    "taskDuration": "PT15M",
    "consumerProperties": {
      "bootstrap.servers": "KAFKA-BOOTSTRAP-SERVER-ADDRESS"
    }
  }
}

``

**I am seeing the segment count and size different for each data source created, although all supervisor tasks are consuming data from the same Kafka topic. Is this expected? **

**How is the segments size/count impacted by the 'ioConfig.taskCount' value?**

Please help me out here.

Regards,

Vinay Patil

Hi Vinay,

As far as I’ve observed, each taskCount will create its own segment. So you would see at least 1, 3, and 5 segments per hour on each datasource respectively (or more if there is large amount of data for example.

If you set up autocompaction (which you should), these segments will get joined together into 1 later on (by default only segments older than one day get compacted), which will speed up your sources. So its not really a problem.

Regards,

Michael