Batch ingestion

hello guys, i am tryning to ingest datas with hadoop and a parquet file.
The parquet file has been generated with a spark 2.0 job.

This file contains 96 dimensions and a timestamp column (epoch). He is compressed in snappy, he contains 7 M rows).

When i try to ingest the file with the “index_hadoop” method and the parquet hadoop parser extension, the map reduce job seems to work fine but i don’t find any datas in druid segment. The jobs ends after 2 minuts.

I did a test, i created a new parquet file with hive ; the datas comes from the first parquet file (the one who comes from spark 2.0), i produced the same columns in my second file than in the first one. The only difference is that i produced the parquet file with hive 2.0 (tez engine). With this new file, the ingestion is working fine, the map phase is longer than with the first file and the reduce phase is around 10 mn.

Do you think that there is a different parquet implementation between spark 2.0 and hive 2.0 in emr ? Or does it comes from the druid extention who is not compatible with spark 2.0 ?



can you share the hadoop job logs ?

Hi Julien,
I guess there might be parsing errors encountered in the first file.

A common error during ingestion is either wrong spec or row format leading to data being dropped due to parsing errors.

here is the log

log.txt (258 KB)

We found our problem, it was a mapping problem between the name of our dimensions in the parquet file and in the json task file who were differents.

Thanks for your help Nishant and Slim.