Hi All ,
I am new to Kafka , and to druid. I am trying to explore druid’s realtime ingestion capabilities, and did a couple of tests in my single node EC2 M5d12x Amazon machine.
I have got some questions during these tests and would be really glad if any of you can put your thoughts around it -
1. how to determine the number of segments and tasks created in Kafka indexing service ?
I have a Kafka broker , where my data rate is ~10k events /minute . And I have submitted the Kafka index service which looks like below :
{
“type”: “kafka”,
“dataSchema”: {
"dataSource": "DS_name",
"parser": {
"type": "string",
"parseSpec": {
"format": "json",
"timestampSpec": {
"column": "utcTimestamp",
"format": "auto"
},
"dimensionsSpec": {
"dimensions": [
"column1",
"column2",
"column3",
"column4"
]
}
}
},
"metricsSpec" : [],
"granularitySpec": {
"type": "uniform",
"segmentGranularity": "MINUTE",
"queryGranularity": "NONE",
"rollup": true
}
},
“ioConfig”: {
"topic": "kafka_topic",
"consumerProperties": {
"bootstrap.servers": "kafka_broker_host:9092"
}
}
}
I see under the indexing logs , there are multiple tasks logs created , when checked one of them it looks like there are 24 partitions (0 -23) in my data , and from the above ingestion spec you can see my segment granularity is at “minute” level
But in the overlord I see more than 24 segments – see below
so my doubts are , how are the number of segments calculated and also why am I seeing multiple index tasks logs for the single Kafka index ingestion spec I submitted?
2. What is “#” in the above screenshot ? Also why is size 0B here ?
My Kafka broker has data for the 7th hour of day 2019-05-23 , I am thinking this 0B would mean the segments are not yet published for querying ! is my understanding right ?
**3. Are there any suggestions around how to make the realtime ingestion faster in druid ? **
Like I said above I submitted my Kafka index task around 2019-05-23 7th hour , it took a lot of time for the 7th hour segments get published in druid whereas in my Kafka broker there were already data stacked up till 11th hour !
4. On one of the Kafka index task logs , I observed the segments are not created/persisted in the correct order as in the 7thhour 57minute -7th hour 58min interval segment was created first whereas 7th hour 32minute - 7th hour 33min interval segment is created later
is there a reason for this ?
**5. Where/How can I find the exact time taken to create a queryable segment for let’s say at minute level granularity ? **
Because the entire job is split in to different tasks ,and there are random segments created in each of these tasks (According to logs) .
Is there a a standard way I can note total time taken to ingest events coming in every minute ? in this case!
Thanks,
Anoosha