Can't Batch Ingest Parquet File from Hadoop

Hi everyone,
I am using imply-1.3.0

The task to create a index following.

I already load

druid.extensions.loadList=[“druid-datasketches”, “druid-avro-extensions”, “druid-parquet-extensions”, “postgresql-metadata-storage”, “druid-hdfs-storage”]

``

The task detail

{
  "type": "index_hadoop",
  "spec": {
    "ioConfig": {
      "type": "hadoop",
      "inputSpec": {
        "type": "static",
        "inputFormat": "io.druid.data.input.parquet.DruidParquetInputFormat",
        "paths": "hdfs://master1:9000/AD_COOKIE_REPORT"
      }
    },
    "dataSchema": {
      "dataSource": "no_metrics",
      "parser": {
        "type": "parquet",
        "parseSpec": {
          "format": "json",
          "timestampSpec": {
            "column": "time",
            "format": "yyyy-mm-dd-HH"
          },
          "dimensionsSpec": {
            "dimensions": [
              "advertiser_id",
              "campaign_id",
              "payment_id",
              "creative_id",
              "website_id",
              "channel_id",
              "section_id",
              "zone_id",
              "ad_default",
              "topic_id",
              "interest_id",
              "inmarket_id",
              "audience_id",
              "os_id",
              "browser_id",
              "device_type",
              "device_id",
              "location_id",
              "age_id",
              "gender_id",
              "network_id",
              "merchant_cate",
              "userId"
            ],
            "dimensionExclusions": [],
            "spatialDimensions": []
          }
        }
      },
      "metricsSpec": [
        {
          "name": "count",
          "type": "count"
        },
        {
          "name": "impression",
          "type": "longSum",
          "fieldName": "impression"
        },
        {
          "name": "viewable",
          "type": "longSum",
          "fieldName": "viewable"
        },
        {
          "name": "revenue",
          "type": "longSum",
          "fieldName": "revenue"
        },
        {
          "name": "proceeds",
          "type": "longSum",
          "fieldName": "proceeds"
        },
        {
          "name": "spent",
          "type": "longSum",
          "fieldName": "spent"
        },
        {
          "name": "click_fraud",
          "type": "longSum",
          "fieldName": "click_fraud"
        },
        {
          "name": "click",
          "type": "longSum",
          "fieldName": "clickdelta"
        },
        {
          "name": "user_unique",
          "type": "hyperUnique",
          "fieldName": "userId"
        }
      ],
      "granularitySpec": {
        "type": "uniform",
        "segmentGranularity": "DAY",
        "queryGranularity": "ALL",
        "intervals": [
          "2016-08-09/2016-08-11"
        ]
      }
    },
    "tuningConfig": {
      "type": "hadoop",
      "partitionsSpec": {
        "type": "hashed",
        "targetPartitionSize": 5000000
      },
      "jobProperties": {}
    }
  }
}

``

But failed without raise any error or log file that I can investigate

{

“task”: “index_hadoop_no_metrics_2016-08-11T02:35:45.667Z”,

“status”: {

“id”: “index_hadoop_no_metrics_2016-08-11T02:35:45.667Z”,

“status”: “FAILED”,

“duration”: 7385

}

}

``

How do I find the logs of this task.

I already find
/data/imply-1.3.0/var/druid/task/

``

But the task failed so fast to get a log file here.

Thank in advance.

I included
➜ imply-1.3.0 ll dist/druid/extensions/druid-parquet-extensions

total 16368

-rw-r–r--@ 1 giaosudau staff 8937 Aug 12 09:13 druid-parquet-extensions-0.9.1.1.jar

-rw-r–r--@ 1 giaosudau staff 109569 Aug 12 09:45 parquet-avro-1.8.1.jar

-rw-r–r--@ 1 giaosudau staff 945914 Aug 12 09:36 parquet-column-1.8.1.jar

-rw-r–r--@ 1 giaosudau staff 38604 Aug 12 09:43 parquet-common-1.8.1.jar

-rw-r–r--@ 1 giaosudau staff 285479 Aug 12 09:53 parquet-encoding-1.8.1.jar

-rw-r–r--@ 1 giaosudau staff 390733 Aug 12 09:47 parquet-format-2.3.1.jar

-rw-r–r--@ 1 giaosudau staff 218076 Aug 12 09:30 parquet-hadoop-1.8.1.jar

-rw-r–r--@ 1 giaosudau staff 1048117 Aug 12 09:53 parquet-jackson-1.8.1.jar

-rw-r–r--@ 1 giaosudau staff 5320231 Aug 12 09:53 parquet-tools-1.8.1.jar

``

It works.

But I have problem with the partition parquet.

FACT_AD_STATS_DAILY/time=2016-07-16/network_id=31713/part-r-00000-5e5c7291-e1e1-462d-9cc6-7ef2d5be892f.snappy.parquet

``

the timestamp field is on folder_name.

2016-08-12T03:17:33,059 ERROR [task-runner-0-priority-0] io.druid.indexing.overlord.ThreadPoolTaskRunner - Exception while running task[HadoopIndexTask{id=index_hadoop_no_metrics_2016-08-12T03:17:23.355Z, type=index_hadoop, dataSource=no_metrics}]
java.lang.RuntimeException: java.lang.reflect.InvocationTargetException
at com.google.common.base.Throwables.propagate(Throwables.java:160) ~[guava-16.0.1.jar:?]
at io.druid.indexing.common.task.HadoopTask.invokeForeignLoader(HadoopTask.java:204) ~[druid-indexing-service-0.9.1.1.jar:0.9.1.1]
at io.druid.indexing.common.task.HadoopIndexTask.run(HadoopIndexTask.java:208) ~[druid-indexing-service-0.9.1.1.jar:0.9.1.1]
at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:436) [druid-indexing-service-0.9.1.1.jar:0.9.1.1]
at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:408) [druid-indexing-service-0.9.1.1.jar:0.9.1.1]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_77]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_77]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_77]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_77]
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_77]
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_77]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_77]
at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_77]
at io.druid.indexing.common.task.HadoopTask.invokeForeignLoader(HadoopTask.java:201) ~[druid-indexing-service-0.9.1.1.jar:0.9.1.1]
… 7 more
Caused by: java.lang.RuntimeException: java.lang.RuntimeException: No buckets?? seems there is no data to index.
at io.druid.indexer.IndexGeneratorJob.run(IndexGeneratorJob.java:211) ~[druid-indexing-hadoop-0.9.1.1.jar:0.9.1.1]
at io.druid.indexer.JobHelper.runJobs(JobHelper.java:323) ~[druid-indexing-hadoop-0.9.1.1.jar:0.9.1.1]
at io.druid.indexer.HadoopDruidIndexerJob.run(HadoopDruidIndexerJob.java:94) ~[druid-indexing-hadoop-0.9.1.1.jar:0.9.1.1]
at io.druid.indexing.common.task.HadoopIndexTask$HadoopIndexGeneratorInnerProcessing.runTask(HadoopIndexTask.java:261) ~[druid-indexing-service-0.9.1.1.jar:0.9.1.1]
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_77]
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_77]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_77]
at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_77]
at io.druid.indexing.common.task.HadoopTask.invokeForeignLoader(HadoopTask.java:201) ~[druid-indexing-service-0.9.1.1.jar:0.9.1.1]
… 7 more
Caused by: java.lang.RuntimeException: No buckets?? seems there is no data to index.
at io.druid.indexer.IndexGeneratorJob.run(IndexGeneratorJob.java:172) ~[druid-indexing-hadoop-0.9.1.1.jar:0.9.1.1]
at io.druid.indexer.JobHelper.runJobs(JobHelper.java:323) ~[druid-indexing-hadoop-0.9.1.1.jar:0.9.1.1]
at io.druid.indexer.HadoopDruidIndexerJob.run(HadoopDruidIndexerJob.java:94) ~[druid-indexing-hadoop-0.9.1.1.jar:0.9.1.1]
at io.druid.indexing.common.task.HadoopIndexTask$HadoopIndexGeneratorInnerProcessing.runTask(HadoopIndexTask.java:261) ~[druid-indexing-service-0.9.1.1.jar:0.9.1.1]
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_77]
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_77]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_77]
at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_77]
at io.druid.indexing.common.task.HadoopTask.invokeForeignLoader(HadoopTask.java:201) ~[druid-indexing-service-0.9.1.1.jar:0.9.1.1]
… 7 more
2016-08-12T03:17:33,071 INFO [task-runner-0-priority-0] io.druid.indexing.overlord.TaskRunnerUtils - Task [index_hadoop_no_metrics_2016-08-12T03:17:23.355Z] status changed to [FAILED].
2016-08-12T03:17:33,077 INFO [task-runner-0-priority-0] io.druid.indexing.worker.executor.ExecutorLifecycle - Task completed with status: {
“id” : “index_hadoop_no_metrics_2016-08-12T03:17:23.355Z”,
“status” : “FAILED”,
“duration” : 6102
}

``

I upload the logs here

How do I import this kind of file?

Hi everyone,
Any ideas on that would be appreciate?

Thanks.

HI Chanh, Parquet is a community extension not officially supported by the Druid committers but the original author is around for help.

In this particular case, the problem is that

“intervals”: [

          "2016-08-09/2016-08-11"
        ]
      }

is not matching your actual data. Ensure you have the correct timezone for your data and that the data listed at your location is actually for the interval you provided.

Hi Fangjin,
Thanks for suggestion.

BTW Is there anyway to ignore the intervals time just import all the thing in the file?

Because sometime we just have a bunch of data and just want to import them all.

Regards,

Chanh

This functionality should most definitely be added in the near future :slight_smile: