Hello, my Druid EKS cluster is having trouble persisting segments. I’ve tried both local and S3 storage, each fail in different ways.
For local storage, the ingestion task completes with the “SUCCESS” status. However, when I click on the Datasources tab, it says under the Availability column “0.0% available”. When I look at the historical’s logs, I see this error message:
Caused by: java.lang.IllegalArgumentException: [/opt/druid/var/druid/segments/wikipedia/2016-06-27T00:00:00.000Z_2016-06-28T00:00:00.000Z/2022-04-18T20:31:26.139Z/0/index.zip] does not exist . When I
exec into my historical, I see that
/opt/druid/var/druid/segments is empty. So for some reason, this index zip is not being written. When I click on the Services tab, it says my historical’s current size is 0B and its max size is 300 GB, so there should be enough room for the wikipedia sample dataset.
For S3 storage, I have even less useful information to go off of. The ingestion fails with this error message:
java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.RuntimeException: jav...
None of my pods’ logs provide additional useful information. I know that the cluster is able to read from S3, as when I start an S3 ingestion I can see data in the preview window. And I am positive that I’ve granted read/write perms on the bucket I want to write to. I’ve set the following values in my
I’ve tested both the Wikipedia sample data and data from one of our S3 buckets, same failures for both cases.
HI James, welcome to the druid forum.
I would start by confirming the values.yaml settings. I would expect to add the S3 path somewhere.
Also, looking more closely at the logs;
In case 1, I would check the task/subtask that is actually writing the segment and look at the logs to confirm the location of deep storage where the segment is being written
In case 2, again, I would look more closely at the right logs. The issue could be at the MM/Overlord, supervisor task or sub-tasks. This might help narrow down the issue with the S3 settings mentioned above.
Hi Vijeth, thank you for the welcome!
Figured out case 2! It was a permissions issue. I’m deploying my stack through CDK, and even though I had ran
grantReadWrite on my service/node IAM roles, I still needed to
grantPutAcl as well. I’ve tested it on the Wikipedia sample data and a small dataset from our S3, and both were successful. I am now testing it on one of our more massive datasets, hopefully in a few hours that will be complete. Since we are going to use S3 for deep storage in the long run, I am ok not knowing what was going wrong in case 1
Thank you again for your help!
In case 1, the local case, the problem is that “local” means that the middle manager and historical think they are running on the same box with access to the same storage. In k8s deployments, normally the MM and historicals are running on separate pods which each have their own separate local storage. It says “COMPLETE” because as far as the MM knows, the segment has been published to Deep Storage (you’ll find the segment in its local storage). But when the historical tries to find it in Deep Storage, there is nothing in its “local” storage.
Ahh I see. If I try a local storage deployment in the future I’ll make sure that MM and historical are on the same node, thanks Sergio!
Not just same node, they would need to be together in the same pod, which is not what the helm chart does. The only way to make that work is if they share storage, i.e. mount the same volume on both, but probably not worth the trouble. MinIO can be used on minikube to provide “local” S3 if that is what you are looking to do with “local”, you would still set that up as “s3” though. Here’s a blog post that describes setting up such a dev cluster on minikube: Clustered Apache Druid® on your Laptop - Easy! - Imply