Druid job fails while running the single_phase_sub_task

I m getting the below error while running my load job from hdfs to druid . It ran successfully for a while and then started failing with below error message . Can anyone help or advise on the issue .

2022-08-17T10:56:49,922 ERROR [task-runner-0-priority-0] org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner - Exception while running task[AbstractTask{id=‘single_phase_sub_task_cp_prod_pochemom_2022-08-17T10:54:30.392Z’, groupId=‘index_parallel_cp_prod_dphhpibg_2022-08-16T07:32:33.164Z’, taskResource=TaskResource{availabilityGroup=‘single_phase_sub_task_cp_prod_pochemom_2022-08-17T10:54:30.392Z’, requiredCapacity=1}, dataSource=‘cp_prod’, context={forceTimeChunkLock=true}}]
java.lang.RuntimeException: file:/folder/druid/task/single_phase_sub_task_cp_prod_pochemom_2022-08-17T10:54:30.392Z/work/indexing-tmp/druid-input-entity4625300376445898780.tmp is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [109, 28, 97, 61]
at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:531) ~[?:?]
at org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:712) ~[?:?]
at org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:609) ~[?:?]
at org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:152) ~[?:?]
at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135) ~[?:?]
at org.apache.druid.data.input.parquet.ParquetReader$1.hasNext(ParquetReader.java:106) ~[?:?]
at org.apache.druid.data.input.IntermediateRowParsingReader$1.hasNext(IntermediateRowParsingReader.java:60) ~[druid-core-0.21.0.jar:0.21.0]
at org.apache.druid.java.util.common.parsers.CloseableIterator$2.findNextIteratorIfNecessary(CloseableIterator.java:85) ~[druid-core-0.21.0.jar:0.21.0]

Did anything change with the Parquet files you are trying to read?

No change . Actually I m loading historic data , so no change at all .

Hi vishalth,

The error is in this part of your log:
“file:/folder/druid/task/single_phase_sub_task_cp_prod_pochemom_2022-08-17T10:54:30.392Z/work/indexing-tmp/druid-input-entity4625300376445898780.tmp is not a Parquet file.”

Which to me implies that you are reading a set of files and among them is something that isn’t Parquet. Are you loading from a path that contains files other than parquet?

Hi Sergio_Ferragut,

The path you see above is not my hdfs path , its some intermediate data path on middle manager where druid writes to . Also , I checked my hdfs path and i dont see any files that are not parquet .

Yes. The path is to the file that has been downloaded into the MM’s local storage.
Have you retried this load and obtained the same result?
I’m wondering a couple of things. It is complaining about a checksum in the file that don’t match: “…is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [109, 28, 97, 61]”

Could one of the files be corrupt? Or the download of the file somehow got corrupted?

Could there be a parquet version problem?

The version of parquet that is in the code for the current Druid release is:

Thanks Sergio for you reply … The parquet version is not the issue coz i have run other loads which had the same parquet version and they ran fine . I m not sure if its file corruption issue , I re-ran the load again , will have to see if it runs fine or not . The load has a week’s worth of historic data , we would have to wait till it reaches to the same failure point .

@vishalth Did you resolve this? What was the solution?
Thanks for sharing.

I believe the issue was with file corruption , had to skip that particular load file and after that everything ran fine .