I’m running druid 0.22.1 and trying to use Autocompaction to update a datasource segmentGranularity from ‘HOUR’ to ‘WEEK’. I’m testing with a small datasource (15mb / ~400,000 rows) which has many segments (5,000+).
Both the web console and curl against the api (e.g. curl get against /druid/coordinator/v1/config/compaction/ ) show that my compaction config has been set/accepted.
This is the only datasource with Auto compaction enabled.
There are segments going back to 2021-05, so there are plenty of candidate segments to compact based on skipOffsetFromLatest.
There are 3 worker slots available and relevant compaction TaskSlot variables are: "compactionTaskSlotRatio" : 0.7, "maxCompactionTaskSlots" : 2147483647
I’m expecting autocompaction to start compacting based on this config.
Instead, I find that in the unified-console, the dataSource Compaction column shows Awaiting first run and has been stuck like this for 24h+. Reviewing Coordinator logs, I don’t see any indication that compaction is even being attempted. No errors. If I grep “compact” in coordinator logs, nothing is returned except for a single line (see logs below). If I grep for “my_datasource” I see only indexing tasks.
Note - There is a log line indicating the coordinator has started up Scheduling compaction (see Logs, below). This line appears in the coordinator which is not presently the leader. The current coordinator leader does not have a similar log line. Could this be related? But I can’t see any way to ‘turn on’ autocompaction asides from submitting compaction config as per above. I also wonder if I should be seeing this log line each coordinator period?
Are there other logs I should be looking at? I have mostly focused on Coordinator and overlord logs.
Thanks in advance!
Things I've tried
I’ve tried tweaking the compaction config with different settings, e.g. setting skipOffsetFromLatest to ‘P1W’ or segmentGranularity to ‘DAY’
Updating compactionTaskSlotRatio higher
Updating taskPriority higher
References
- https://support.imply.io/hc/en-us/articles/360046597493-How-to-determine-the-number-of-slots-required-for-auto-compaction-
- Druid docs
Logs
I can't find anything that seems relevant. The only log which matches a grep of "compact" is this:
[main] org.apache.druid.server.coordinator.duty.CompactSegments - Scheduling compaction with skipLockedIntervals [true]
I’m wondering if you’re running into This policy currently cannot handle the situation when there are a lot of small segments which have the same interval, and their total size exceeds inputSegmentSizeBytes. If it finds such segments, it simply skips them.? I found that when I looked at the Compacting Segments and Segment search policy docs.
I’ll review those docs again. However, since the dataSource in total is < 20mb, and the inputSegmentSizeBytes is approximately 420 mb, I don’t think this is the case. Each segment is < 4KB, and going from hourly to weekly segments should be ~ 4KB/hour * 24hours * 7days = 672 KB as input size, much smaller than ~420mb.
That will be our plan B. We had upgraded to 0.22+ specificially to take advantage of updating segmentGranularity via autocompaction though. I will see if I can reproduce my scenario in in a sandbox and try to figure out if it’s a bug or config issue.
In the quickstart docker environment, I was able to get the auto-compaction to run against a sample of data from this datasource, using the same compaction spec. It must be something unique to our cluster or cluster state. I’ll try to update here what I find.
Sounds good. I’ll still try to reproduce what you reported. A colleague and I were just discussing it, and he might do the same. Looking forward to continuing the exploration and discussion.
I took this a bit further this morning, modifying the quickstart docker-compose to include two coordinator/overlords and testing fail-over. It all worked as expected (compaction started & succeeded).
But the ‘smoking gun’ to me is that in the bad cluster lead coordinator, there are no log lines matching grep NewestSegmentFirstIterator.
And reviewing more closely, I noticed in the good cluster, both coordinators logged : INFO [main] org.apache.druid.server.coordinator.duty.CompactSegments - Scheduling compaction with skipLockedIntervals [true]
during startup (within a few seconds of each other). In my bad cluster, the current leader never issued this log line during startup or failover, so I guess coordinator.duty.CompactSegments was never running on this coordinator since before compaction specs were issued, hence the stuck in Awaiting First Run status.
Are there any configs / conditions which would cause a coordinator to start up without this duty being scheduled?
I think we will just try restarting coordinators and checking they both emit this log-line.
Can you send me your docker-compose after theses changes? I’m testing 0.22 via docker-compose and have the same issue with awaiting first run for a few weeks now
@Renato_Cron Here was my docker-compose with two coordinators. As mentioned, I couldn’t seem to break it locally the same way I’m experiencing on our deployed cluster. Compaction proceeded each time despite failovers and other hiccups I tried to throw at it.
Yesterday we restarted most of our cluster nodes including coordinators, and still compaction won’t begin.
@Mark_Herrera any other thoughts or tests I might run?
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
#
version: "2.2"
volumes:
metadata_data: {}
middle_var: {}
historical_var: {}
broker_var: {}
coordinator_var: {}
router_var: {}
druid_shared: {}
services:
postgres:
container_name: postgres
image: postgres:latest
volumes:
- metadata_data:/var/lib/postgresql/data
environment:
- POSTGRES_PASSWORD=FoolishPassword
- POSTGRES_USER=druid
- POSTGRES_DB=druid
# Need 3.5 or later for container nodes
zookeeper:
container_name: zookeeper
image: zookeeper:3.5
ports:
- "2181:2181"
environment:
- ZOO_MY_ID=1
coordinator:
image: apache/druid:0.22.1
container_name: coordinator
volumes:
- druid_shared:/opt/shared
- coordinator_var:/opt/druid/var
- ./data/:/opt/ingest
depends_on:
- zookeeper
- postgres
ports:
- "8081:8081"
command:
- coordinator
env_file:
- environment
coordinator_b:
image: apache/druid:0.22.1
container_name: coordinator_b
volumes:
- druid_shared:/opt/shared
- coordinator_var:/opt/druid/var
- ./data/:/opt/ingest
depends_on:
- zookeeper
- postgres
ports:
- "8881:8081"
command:
- coordinator
env_file:
- environment
broker:
image: apache/druid:0.22.1
container_name: broker
volumes:
- broker_var:/opt/druid/var
depends_on:
- zookeeper
- postgres
- coordinator
ports:
- "8082:8082"
command:
- broker
env_file:
- environment
historical:
image: apache/druid:0.22.1
container_name: historical
volumes:
- druid_shared:/opt/shared
- historical_var:/opt/druid/var
depends_on:
- zookeeper
- postgres
- coordinator
ports:
- "8083:8083"
command:
- historical
env_file:
- environment
middlemanager:
image: apache/druid:0.22.1
container_name: middlemanager
volumes:
- druid_shared:/opt/shared
- middle_var:/opt/druid/var
- ./data/:/opt/ingest
depends_on:
- zookeeper
- postgres
- coordinator
ports:
- "8091:8091"
- "8100-8105:8100-8105"
command:
- middleManager
env_file:
- environment
router:
image: apache/druid:0.22.1
container_name: router
volumes:
- router_var:/opt/druid/var
depends_on:
- zookeeper
- postgres
- coordinator
ports:
- "8888:8888"
command:
- router
env_file:
- environment
@gitmstoute I might have managed to reproduce the behavior by following our Quickstart tutorial and selecting hour as the segment granularity? Auto compaction wouldn’t kick off. Let me spin it up again and take a look at the logs.
@Mark_Herrera This seemed to reproduce it for me too! The compaction did not start overnight. The logs look similar to my bad cluster, there are no compaction tasks being launched, NewestSegmentFirstIterator never seems to run.