hello guys, i am tryning to ingest datas with hadoop and a parquet file.
The parquet file has been generated with a spark 2.0 job.
This file contains 96 dimensions and a timestamp column (epoch). He is compressed in snappy, he contains 7 M rows).
When i try to ingest the file with the “index_hadoop” method and the parquet hadoop parser extension, the map reduce job seems to work fine but i don’t find any datas in druid segment. The jobs ends after 2 minuts.
I did a test, i created a new parquet file with hive ; the datas comes from the first parquet file (the one who comes from spark 2.0), i produced the same columns in my second file than in the first one. The only difference is that i produced the parquet file with hive 2.0 (tez engine). With this new file, the ingestion is working fine, the map phase is longer than with the first file and the reduce phase is around 10 mn.
Do you think that there is a different parquet implementation between spark 2.0 and hive 2.0 in emr ? Or does it comes from the druid extention who is not compatible with spark 2.0 ?