[druid-user] segments

My ingestion spec includes the following granularitySpec and transformSpec.

“granularitySpec”: {
“segmentGranularity”: “day”,
“queryGranularity”: “fifteen_minute”,
“rollup”: true
},
“transformSpec”: {
“transforms”: [
{ “type”: “expression”, “name”: “__time”, “expression”: “timestamp_ceil(__time, ‘PT15M’) + __time - timestamp_floor(__time,‘PT15M’)” }
]
}

Although I have many days in my kafka topic, I noticed that when I run this
a single nano-server or single small server, it only creates two segments.

I am expecting a segment each day.

Am I missing something?

I’ll try to reproduce this. For my own clarification: are the Kafka tasks running but not publishing segments?

Also it may be worth checking the setting you have for useEarliestOffset and *messageRejection* in your tuningConfig

Hi Mark, yes, that was the behavior that I observed. The segments were not published. I was running this on both nano quickstart and single server small configurations without deep storage.

Is there a limit to how much data it can ingest? Most of my settings are default. I only changed the ports around and enabled Kafka Indexing and avro.

Hi Peter,

Thank you for replying. I did not set any of the messageRejection configurations.

There shouldn’t be a limit as to how much data can be stored - up to your available disk space. If records aren’t being read, you might check task logs, middle manager logs, coordinator logs. If they are, but you only see 2 segments, I’d also check the timestamp config in the spec. You said you have many days of data, but I’d verify that you’re pulling the right field for the timestamp.

Well, this only happens when “useEarliestOffset”: true. If I use “useEarliestOffset”: false then I can see many different segments load.

This kafka topic constantly recieves time-series data.

Is the data in the topic published in time order ?

It seems strange that useEarliestOffset = false would produce many segments. I would expect the opposite because it is supposed to mean that you start reading the topic from the end, from the most recent messages and only continue with new messages. useEarliestOffset = true would read from the beginning of the topic, from the earliest message available in the topic and read all the messages up to now.

One other item that might come into play is that if you ran a streaming ingestion spec that feeds a given datasource and then use another ingestion spec (or an updated one) that still targets the same datasource/topic, the ingestion will continue from the last offset it completed before, disregarding the useEarliestOffset setting. So this might be messing with your expected results.

In order to reset this behavior you will need to do a Hard Reset on the supervisor task either through the UI or using the API:

  • /druid/indexer/v1/supervisor//reset