"error in opening zip file" in partial segment merge task during native batch index parallel

I am able to run a native batch index parallel task from s3 when I set

"maxNumConcurrentSubTasks": 1

But if I increase it (make it actually run parallel), it fails during merge because I cannot open the zip files.

Error log (edited out the datasource:

2020-05-29T18:32:50,061 INFO [task-runner-0-priority-0] org.apache.druid.utils.CompressionUtils - Unzipping file[var/druid/task/partial_index_merge_datasource_1_bkoigfmc_2020-05-29T18:32:45.725Z/work/indexing-tmp/2020-05-15T04:00:00.000Z/2020-05-16T04:00:00.000Z/3/temp_partial_index_generate_datasource_1_eablpinf_2020-05-29T18:32:25.648Z] to [var/druid/task/partial_index_merge_datasource_1_bkoigfmc_2020-05-29T18:32:45.725Z/work/indexing-tmp/2020-05-15T04:00:00.000Z/2020-05-16T04:00:00.000Z/3/unzipped_partial_index_generate_datasource_1_eablpinf_2020-05-29T18:32:25.648Z] 2020-05-29T18:32:50,062 ERROR [task-runner-0-priority-0] org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner - Exception while running task[AbstractTask{id='partial_index_merge_datasource_1_bkoigfmc_2020-05-29T18:32:45.725Z', groupId='index_parallel_datasource_1_cgdapoll_2020-05-29T18:32:21.039Z', taskResource=TaskResource{availabilityGroup='partial_index_merge_datasource_1_bkoigfmc_2020-05-29T18:32:45.725Z', requiredCapacity=1}, dataSource='datasource_1', context={forceTimeChunkLock=true}}] java.util.zip.ZipException: error in opening zip file at java.util.zip.ZipFile.open(Native Method) ~[?:1.8.0_252] at java.util.zip.ZipFile.<init>(ZipFile.java:225) ~[?:1.8.0_252] at java.util.zip.ZipFile.<init>(ZipFile.java:155) ~[?:1.8.0_252] at java.util.zip.ZipFile.<init>(ZipFile.java:169) ~[?:1.8.0_252] at org.apache.druid.utils.CompressionUtils.unzip(CompressionUtils.java:250) ~[druid-core-0.18.1.jar:0.18.1] at org.apache.druid.indexing.common.task.batch.parallel.PartialSegmentMergeTask.fetchSegmentFiles(PartialSegmentMergeTask.java:231) ~[druid-indexing-service-0.18.1.jar:0.18.1] at org.apache.druid.indexing.common.task.batch.parallel.PartialSegmentMergeTask.runTask(PartialSegmentMergeTask.java:169) ~[druid-indexing-service-0.18.1.jar:0.18.1] at org.apache.druid.indexing.common.task.batch.parallel.PartialHashSegmentMergeTask.runTask(PartialHashSegmentMergeTask.java:44) ~[druid-indexing-service-0.18.1.jar:0.18.1] at org.apache.druid.indexing.common.task.AbstractBatchIndexTask.run(AbstractBatchIndexTask.java:123) ~[druid-indexing-service-0.18.1.jar:0.18.1] at org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner$SingleTaskBackgroundRunnerCallable.call(SingleTaskBackgroundRunner.java:421) [druid-indexing-service-0.18.1.jar:0.18.1] at org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner$SingleTaskBackgroundRunnerCallable.call(SingleTaskBackgroundRunner.java:393) [druid-indexing-service-0.18.1.jar:0.18.1] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_252] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_252] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_252] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_252] 2020-05-29T18:32:50,086 INFO [task-runner-0-priority-0] org.apache.druid.indexing.worker.executor.ExecutorLifecycle - Task completed with status: { "id" : "partial_index_merge_datasource_1_bkoigfmc_2020-05-29T18:32:45.725Z", "status" : "FAILED", "duration" : 195, "errorMsg" : "java.util.zip.ZipException: error in opening zip file", "location" : { "host" : null, "port" : -1, "tlsPort" : -1 } }

Seeing the error in druid 0.18.0 and 0.18.1

Has anyone experienced a similar issue?

I was looking at the partial_index_generate task and found this in the logs:


2020-06-02T00:22:47,226 INFO [[partial_index_generate_datasource_1_pjdlhfhb_2020-06-02T00:22:33.630Z]-appenderator-merge] org.apache.druid.indexing.worker.IntermediaryDataManager - Wrote intermediary segment for segment[datasource_1_2020-05-02T00:00:00.000-04:00_2020-05-03T00:00:00.000-04:00_2020-06-02T00:22:29.285Z_1] of subtask[partial_index_generate_datasource_1_pjdlhfhb_2020-06-02T00:22:33.630Z] at [var/tmp/intermediary-segments/index_parallel_datasource_1_kfncdlim_2020-06-02T00:22:29.265Z/2020-05-02T00:00:00.000-04:00/2020-05-03T00:00:00.000-04:00/1/partial_index_generate_datasource_1_pjdlhfhb_2020-06-02T00:22:33.630Z] 2020-06-02T00:22:47,234 INFO [[partial_index_generate_datasource_1_pjdlhfhb_2020-06-02T00:22:33.630Z]-appenderator-merge] org.apache.druid.segment.realtime.appenderator.AppenderatorImpl - Segment[datasource_1_2020-05-02T00:00:00.000-04:00_2020-05-03T00:00:00.000-04:00_2020-06-02T00:22:29.285Z_1] of 18,090 bytes built from 1 incremental persist(s) in 57ms; pushed to deep storage in 16ms. Load spec is: {} 2020-06-02T00:22:47,257 WARN [[partial_index_generate_datasource_1_pjdlhfhb_2020-06-02T00:22:33.630Z]-appenderator-merge] org.apache.druid.utils.CompressionUtils - No .zip suffix[var/druid/task/partial_index_generate_datasource_1_pjdlhfhb_2020-06-02T00:22:33.630Z/temp/datasource_1_2020-05-02T00:00:00.000-04:00_2020-05-03T00:00:00.000-04:00_2020-06-02T00:22:29.285Z], putting files from [var/druid/task/partial_index_generate_datasource_1_pjdlhfhb_2020-06-02T00:22:33.630Z/work/persist/datasource_1_2020-05-02T00:00:00.000-04:00_2020-05-03T00:00:00.000-04:00_2020-06-02T00:22:29.285Z/merged] into it anyway.

It looks like the generate task is not saving the file as .zip, and that is causing the ZipException later. I’m not sure why it is not storing the file correctly though.

I was able to narrow down the error. Error only triggers when setting segmentGranularity with a timezone:

"granularitySpec": { "segmentGranularity": { "type": "period", "period": "P1D", "timeZone": "America/New_York" }, "queryGranularity": { "type": "period", "period": "P1D", "timeZone": "America/New_York" }, "rollup": true, "intervals": [ "2020-04-24T00:00:00-04:00/2020-04-25T00:00:00-04:00" ] },

I tried unzipping the file. It’s not a zip file, it’s just a text file that says:

`
Can’t find the partition for supervisorTask[index_parallel_datasource_1_blfjhgaa_2020-06-04T20:59:52.813Z], subTask[partial_index_generate_datasource_1_bhdbbldd_2020-06-04T20:59:57.597Z], interval[2020-04-24T04:00:00.000Z/2020-04-25T04:00:00.000Z], and partitionId[1]%

`

The merge task is sending a request to get the partition, but the partition doesn’t exist so the error message (which is not a zip file) is saved instead.

I think the reason it is not found might be because the interval for the partition in the generate task is stored like:
2020-04-24T00:00:00-04:00/2020-04-25T00:00:00-04:00
But when the merge task requests the partition it is requesting this interval:
`
2020-04-24T04:00:00.000Z/2020-04-25T04:00:00.000Z

`

I’ll try to look through the code to see if there is a solution, but it looks like it might be a bug.

To reproduce the error:

  1. Run an index_parallel task with segmentGranularity containing a timezone, and maxNumConcurrentSubTasks > 1

@tarpd,

Currently we are facing the same issue.
By any chance are you able get root cause and solution for this issue.

Thanks
Jay R

I created an issue on github to track it. https://github.com/apache/druid/issues/9993
Unfortunately the only workaround right now is to run tasks with “maxNumConcurrentSubTasks”: 1
I wasn’t able to figure out a quick fix when I looked into it a few months ago

Thanks @tarpd.
“maxNumConcurrentSubTasks”: 1 will restrict total peon (sub) tasks to 1. I am loading 10 billion rows with 10- 64CPU machines. There are totally 1031 spawned and currently “maxNumConcurrentSubTasks” id set to 600. Restricting to 1 is not feasible in my use case.

@tarpd I saw that you discussed this on Github with Jihoon. Do you have additional Druid questions/issues that I could help you with?
@jay r has your issue been resolved?
Thanks,
Matt

Hi Matt,

The issue was coming when i used 10 middle manager nodes i3.16x. when i changed mm nodes to 40 of i3.4x, the issue didnt come.