dataSource Autocompaction 'Awaiting first run'

I’m running druid 0.22.1 and trying to use Autocompaction to update a datasource segmentGranularity from ‘HOUR’ to ‘WEEK’. I’m testing with a small datasource (15mb / ~400,000 rows) which has many segments (5,000+).

  • I submit a compaction config like:
{
  "dataSource": "my_datasource",
  "taskPriority": 25,
  "inputSegmentSizeBytes": 419430400,
  "skipOffsetFromLatest": "P1M",
  "tuningConfig": {
    "partitionsSpec": {
      "type": "dynamic",
      "maxRowsPerSegment": 5000000
    },
    "type": "index_parallel",
    "forceGuaranteedRollup": false
  },
  "granularitySpec": {
    "segmentGranularity": "WEEK"
  }
}
  • Both the web console and curl against the api (e.g. curl get against /druid/coordinator/v1/config/compaction/ ) show that my compaction config has been set/accepted.
  • This is the only datasource with Auto compaction enabled.
  • There are segments going back to 2021-05, so there are plenty of candidate segments to compact based on skipOffsetFromLatest.
  • There are 3 worker slots available and relevant compaction TaskSlot variables are: "compactionTaskSlotRatio" : 0.7, "maxCompactionTaskSlots" : 2147483647

I’m expecting autocompaction to start compacting based on this config.

Instead, I find that in the unified-console, the dataSource Compaction column shows Awaiting first run and has been stuck like this for 24h+. Reviewing Coordinator logs, I don’t see any indication that compaction is even being attempted. No errors. If I grep “compact” in coordinator logs, nothing is returned except for a single line (see logs below). If I grep for “my_datasource” I see only indexing tasks.

Note - There is a log line indicating the coordinator has started up Scheduling compaction (see Logs, below). This line appears in the coordinator which is not presently the leader. The current coordinator leader does not have a similar log line. Could this be related? But I can’t see any way to ‘turn on’ autocompaction asides from submitting compaction config as per above. I also wonder if I should be seeing this log line each coordinator period?

Are there other logs I should be looking at? I have mostly focused on Coordinator and overlord logs.

Thanks in advance!


Things I've tried
  • I’ve tried tweaking the compaction config with different settings, e.g. setting skipOffsetFromLatest to ‘P1W’ or segmentGranularity to ‘DAY’
  • Updating compactionTaskSlotRatio higher
  • Updating taskPriority higher
References - https://support.imply.io/hc/en-us/articles/360046597493-How-to-determine-the-number-of-slots-required-for-auto-compaction- - Druid docs
Logs I can't find anything that seems relevant. The only log which matches a grep of "compact" is this:
[main] org.apache.druid.server.coordinator.duty.CompactSegments - Scheduling compaction with skipLockedIntervals [true]

Relates to Apache Druid 0.22.1

I’m wondering if you’re running into This policy currently cannot handle the situation when there are a lot of small segments which have the same interval, and their total size exceeds inputSegmentSizeBytes. If it finds such segments, it simply skips them.? I found that when I looked at the Compacting Segments and Segment search policy docs.

I’ll review those docs again. However, since the dataSource in total is < 20mb, and the inputSegmentSizeBytes is approximately 420 mb, I don’t think this is the case. Each segment is < 4KB, and going from hourly to weekly segments should be ~ 4KB/hour * 24hours * 7days = 672 KB as input size, much smaller than ~420mb.

This might be an imperfect solution, but could you try to use the force compaction task config from the druid console?

That will be our plan B. We had upgraded to 0.22+ specificially to take advantage of updating segmentGranularity via autocompaction though. I will see if I can reproduce my scenario in in a sandbox and try to figure out if it’s a bug or config issue.

I’ve just blocked off some time today to do the same. I’ll try to reproduce what you’re reporting and see what I can see.

1 Like

In the quickstart docker environment, I was able to get the auto-compaction to run against a sample of data from this datasource, using the same compaction spec. It must be something unique to our cluster or cluster state. I’ll try to update here what I find.

1 Like

Sounds good. I’ll still try to reproduce what you reported. A colleague and I were just discussing it, and he might do the same. Looking forward to continuing the exploration and discussion.

I took this a bit further this morning, modifying the quickstart docker-compose to include two coordinator/overlords and testing fail-over. It all worked as expected (compaction started & succeeded).

But the ‘smoking gun’ to me is that in the bad cluster lead coordinator, there are no log lines matching grep NewestSegmentFirstIterator.

And reviewing more closely, I noticed in the good cluster, both coordinators logged :
INFO [main] org.apache.druid.server.coordinator.duty.CompactSegments - Scheduling compaction with skipLockedIntervals [true]
during startup (within a few seconds of each other). In my bad cluster, the current leader never issued this log line during startup or failover, so I guess coordinator.duty.CompactSegments was never running on this coordinator since before compaction specs were issued, hence the stuck in Awaiting First Run status.

Are there any configs / conditions which would cause a coordinator to start up without this duty being scheduled?

I think we will just try restarting coordinators and checking they both emit this log-line.

Blast I also just noticed I have been testing on 0.22.0 while the bad cluster is 0.22.1. I’ll try to quickly repeat my findings.

1 Like

Hi!

Can you send me your docker-compose after theses changes? I’m testing 0.22 via docker-compose and have the same issue with awaiting first run for a few weeks now

@Renato_Cron Here was my docker-compose with two coordinators. As mentioned, I couldn’t seem to break it locally the same way I’m experiencing on our deployed cluster. Compaction proceeded each time despite failovers and other hiccups I tried to throw at it.

Yesterday we restarted most of our cluster nodes including coordinators, and still compaction won’t begin.

@Mark_Herrera any other thoughts or tests I might run?

# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
#   http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied.  See the License for the
# specific language governing permissions and limitations
# under the License.
#
version: "2.2"

volumes:
  metadata_data: {}
  middle_var: {}
  historical_var: {}
  broker_var: {}
  coordinator_var: {}
  router_var: {}
  druid_shared: {}


services:
  postgres:
    container_name: postgres
    image: postgres:latest
    volumes:
      - metadata_data:/var/lib/postgresql/data
    environment:
      - POSTGRES_PASSWORD=FoolishPassword
      - POSTGRES_USER=druid
      - POSTGRES_DB=druid

  # Need 3.5 or later for container nodes
  zookeeper:
    container_name: zookeeper
    image: zookeeper:3.5
    ports:
      - "2181:2181"
    environment:
      - ZOO_MY_ID=1

  coordinator:
    image: apache/druid:0.22.1
    container_name: coordinator
    volumes:
      - druid_shared:/opt/shared
      - coordinator_var:/opt/druid/var
      - ./data/:/opt/ingest
    depends_on: 
      - zookeeper
      - postgres
    ports:
      - "8081:8081"
    command:
      - coordinator
    env_file:
      - environment

  coordinator_b:
    image: apache/druid:0.22.1
    container_name: coordinator_b
    volumes:
      - druid_shared:/opt/shared
      - coordinator_var:/opt/druid/var
      - ./data/:/opt/ingest
    depends_on: 
      - zookeeper
      - postgres
    ports:
      - "8881:8081"
    command:
      - coordinator
    env_file:
      - environment

  broker:
    image: apache/druid:0.22.1
    container_name: broker
    volumes:
      - broker_var:/opt/druid/var
    depends_on: 
      - zookeeper
      - postgres
      - coordinator
    ports:
      - "8082:8082"
    command:
      - broker
    env_file:
      - environment

  historical:
    image: apache/druid:0.22.1
    container_name: historical
    volumes:
      - druid_shared:/opt/shared
      - historical_var:/opt/druid/var
    depends_on: 
      - zookeeper
      - postgres
      - coordinator
    ports:
      - "8083:8083"
    command:
      - historical
    env_file:
      - environment

  middlemanager:
    image: apache/druid:0.22.1
    container_name: middlemanager
    volumes:
      - druid_shared:/opt/shared
      - middle_var:/opt/druid/var
      - ./data/:/opt/ingest
    depends_on: 
      - zookeeper
      - postgres
      - coordinator
    ports:
      - "8091:8091"
      - "8100-8105:8100-8105"
    command:
      - middleManager
    env_file:
      - environment

  router:
    image: apache/druid:0.22.1
    container_name: router
    volumes:
      - router_var:/opt/druid/var
    depends_on:
      - zookeeper
      - postgres
      - coordinator
    ports:
      - "8888:8888"
    command:
      - router
    env_file:
      - environment

@gitmstoute I might have managed to reproduce the behavior by following our Quickstart tutorial and selecting hour as the segment granularity? Auto compaction wouldn’t kick off. Let me spin it up again and take a look at the logs.

@Mark_Herrera This seemed to reproduce it for me too! The compaction did not start overnight. The logs look similar to my bad cluster, there are no compaction tasks being launched, NewestSegmentFirstIterator never seems to run.

Hey! Out of interest, is there a logged issue in Github for this? (Just for record…)

No issue logged yet. Should I create one?

1 Like

@gitmstoute feel free to create one!

Thank you everyone for the support here. @petermarshallio @Mark_Herrera fyi I opened this issue on Github:

1 Like