I’ve deployed a Druid cluster to an EKS cluster. I did this by copying Druid’s sample Helm chart into my own project and making some edits.
I’ve managed to get the cluster up and running, and am enjoying playing around with the values.yaml file to get some basic query optimizations in place. But when I uncomment the druid_segmentCache_locations
param, my historicals get stuck in a crash loop.
I’m still getting the hang of Kubernetes and I’m not too sure how to debug this, especially since the container crash deletes all of its log files. How do I identify what’s causing this issue?
Edit: I’m using the commented-out-by-default value for druid_segmentCache_locations
:
druid_segmentCache_locations: '[{"path":"/var/druid/segment-cache","maxSize":300000000000}]'
The kubectl describe
output for the failing pod:
Name: druid-historical-0
Namespace: dev
Priority: 0
Node: ip-172-31-2-122.ec2.internal/172.31.2.122
Start Time: Fri, 20 May 2022 14:16:45 -0400
Labels: app=druid
component=historical
controller-revision-hash=druid-historical-58bc895cd9
release=druid
statefulset.kubernetes.io/pod-name=druid-historical-0
Annotations: kubernetes.io/psp: eks.privileged
Status: Running
IP: 172.31.2.244
IPs:
IP: 172.31.2.244
Controlled By: StatefulSet/druid-historical
Containers:
druid:
Container ID: docker://fb6887b03ad67f1e5833994b1842b40d08770292942ab8fa468f51f6606e9f71
Image: apache/druid:0.22.0
Image ID: docker-pullable://apache/druid@sha256:626fd96a997361dce8452c68b28e935a2453153f0d743cf208a0b4355a4fc2c3
Port: 8083/TCP
Host Port: 0/TCP
Args:
historical
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Fri, 20 May 2022 14:19:32 -0400
Finished: Fri, 20 May 2022 14:19:42 -0400
Ready: False
Restart Count: 4
Liveness: http-get http://:8083/status/health delay=60s timeout=1s period=10s #success=1 #failure=3
Readiness: http-get http://:8083/status/health delay=60s timeout=1s period=10s #success=1 #failure=3
Environment Variables from:
druid ConfigMap Optional: false
Environment:
DRUID_MAXDIRECTMEMORYSIZE: 15G
DRUID_XMS: 8G
DRUID_XMX: 10G
druid_processing_buffer_sizeBytes: 500000000
druid_processing_numMergeBuffers: 4
druid_processing_numThreads: 15
druid_segmentCache_locations: [{"maxSize":300000000000}]
AWS_DEFAULT_REGION: us-east-1
AWS_REGION: us-east-1
AWS_ROLE_ARN: arn:aws:iam::552593679126:role/DruidEksStack-druidClusterdruidServiceAccountRole3-1H3W6KL1OZ6K5
AWS_WEB_IDENTITY_TOKEN_FILE: /var/run/secrets/eks.amazonaws.com/serviceaccount/token
Mounts:
/opt/druid/var/druid/ from data (rw)
/var/run/secrets/eks.amazonaws.com/serviceaccount from aws-iam-token (ro)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-5mhmf (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
aws-iam-token:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 86400
data:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: data-druid-historical-0
ReadOnly: false
kube-api-access-5mhmf:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: role=data-server
Tolerations: eks.amazonaws.com/compute-type=fargate:NoSchedule
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 13m default-scheduler Successfully assigned dev/druid-historical-0 to ip-172-31-2-122.ec2.internal
Normal SuccessfulAttachVolume 13m attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-e32e6f48-55cd-4414-b503-c623c53026de"
Normal Pulling 12m kubelet Pulling image "apache/druid:0.22.0"
Normal Pulled 12m kubelet Successfully pulled image "apache/druid:0.22.0" in 12.084762823s
Normal Created 10m (x5 over 12m) kubelet Created container druid
Normal Started 10m (x5 over 12m) kubelet Started container druid
Normal Pulled 10m (x4 over 12m) kubelet Container image "apache/druid:0.22.0" already present on machine
Warning BackOff 2m38s (x45 over 12m) kubelet Back-off restarting failed container