Understand Kafka indexing service in druid

Hi All ,

I am new to Kafka , and to druid. I am trying to explore druid’s realtime ingestion capabilities, and did a couple of tests in my single node EC2 M5d12x Amazon machine.

I have got some questions during these tests and would be really glad if any of you can put your thoughts around it -

1. how to determine the number of segments and tasks created in Kafka indexing service ?

I have a Kafka broker , where my data rate is ~10k events /minute . And I have submitted the Kafka index service which looks like below :

{

“type”: “kafka”,

“dataSchema”: {

"dataSource": "DS_name",

"parser": {

  "type": "string",

  "parseSpec": {

    "format": "json",

    "timestampSpec": {

      "column": "utcTimestamp",

      "format": "auto"

    },

    "dimensionsSpec": {

      "dimensions": [

        "column1",

        "column2",

        "column3",

        "column4"

      ]

    }

  }

},

"metricsSpec" : [],

"granularitySpec": {

  "type": "uniform",

  "segmentGranularity": "MINUTE",

  "queryGranularity": "NONE",

  "rollup": true

}

},

“ioConfig”: {

"topic": "kafka_topic",

"consumerProperties": {

  "bootstrap.servers": "kafka_broker_host:9092"

}

}

}

I see under the indexing logs , there are multiple tasks logs created , when checked one of them it looks like there are 24 partitions (0 -23) in my data , and from the above ingestion spec you can see my segment granularity is at “minute” level

But in the overlord I see more than 24 segments – see below

so my doubts are , how are the number of segments calculated and also why am I seeing multiple index tasks logs for the single Kafka index ingestion spec I submitted?

2. What is “#” in the above screenshot ? Also why is size 0B here ?

My Kafka broker has data for the 7th hour of day 2019-05-23 , I am thinking this 0B would mean the segments are not yet published for querying ! is my understanding right ?

**3. Are there any suggestions around how to make the realtime ingestion faster in druid ? **

Like I said above I submitted my Kafka index task around 2019-05-23 7th hour , it took a lot of time for the 7th hour segments get published in druid whereas in my Kafka broker there were already data stacked up till 11th hour !

4. On one of the Kafka index task logs , I observed the segments are not created/persisted in the correct order as in the 7thhour 57minute -7th hour 58min interval segment was created first whereas 7th hour 32minute - 7th hour 33min interval segment is created later

is there a reason for this ?

**5. Where/How can I find the exact time taken to create a queryable segment for let’s say at minute level granularity ? **

Because the entire job is split in to different tasks ,and there are random segments created in each of these tasks (According to logs) .

Is there a a standard way I can note total time taken to ingest events coming in every minute ? in this case!

Thanks,

Anoosha

Anoosha,

Can you attach an indexing log from one of the above tasks? It would also be helpful to see your common.runtime.properties file.

As far as I can tell, you are not ingesting any data at all.

Best,

Kiefer

To answer a few of your questions, though, Druid will continuously run an ingestion job for the configured time (“taskDuration”) and publish a segment to deep storage each minute (in this case, as your “segmentGranularity” is set to “MINUTE”).

Hi Kiefer ,

I am able to ingest the data 0B we see against the size is only for few minutes, after sometime , the size will be increased

I wonder why is that ? and wanted to know if I fasten that process

Thanks,

Anoosha

Hi Kiefer ,

you mean to say , my minute level granularity would mean , my data is partitioned for every minute and after that its published to Deepstorage, how do I know what is taking so long for it to publish in my case ? because I am expecting realtime availability of data for querying

Thanks,

Anoosha

Are you running Druid locally? It could be completely dependent on your Internet speed.

How long does it usually take for your data to become available?

Hi Kiefer ,

I am running druid on amazon EC2 "M5d12x" machine.

This is the fist time I am doing it , so I don’t have a prior data points on the load

Thanks,

Anoosha

Have you had a look at your Overlord logs or task ingest logs to see if there is anything strange there?

Those may be of some help.

Kiefer