How to ingest batch data to druid

Hi,

I have csv files stored in HDFS. I want to ingest them to druid. How can i do this ?

I am new to druid, using it for the first time.

Please help me.

Thank you.

Heta

Please check if this would be useful :

http://druid.io/docs/latest/tutorials/tutorial-batch-hadoop.html

https://docs.imply.io/on-prem/manage-data/ingestion-files

You will need appropriate Hadoop extensions.

I performed the steps given in the document you gave. but i am facing the error.

I am submitting the task using below command:

curl -X 'POST' -H 'Content-Type:application/json' -d @flight-index.json [http://localhost:8090/druid/indexer/v1/task](http://localhost:8090/druid/indexer/v1/task)                   

It throws below error:

2019-01-05 10:47:25,619 pool-1-thread-1 ERROR Unable to register shutdown hook because JVM is shutting down. java.lang.IllegalStateException: Not started
 at io.druid.common.config.Log4jShutdown.addShutdownCallback(Log4jShutdown.java:46)
 at org.apache.logging.log4j.core.impl.Log4jContextFactory.addShutdownCallback(Log4jContextFactory.java:273)
 at org.apache.logging.log4j.core.LoggerContext.setUpShutdownHook(LoggerContext.java:256)
 at org.apache.logging.log4j.core.LoggerContext.start(LoggerContext.java:216)
 at org.apache.logging.log4j.core.impl.Log4jContextFactory.getContext(Log4jContextFactory.java:145)
 at org.apache.logging.log4j.core.impl.Log4jContextFactory.getContext(Log4jContextFactory.java:41)
 at org.apache.logging.log4j.LogManager.getContext(LogManager.java:182)
 at org.apache.logging.log4j.spi.AbstractLoggerAdapter.getContext(AbstractLoggerAdapter.java:103)
 at org.apache.logging.slf4j.Log4jLoggerFactory.getContext(Log4jLoggerFactory.java:43)
 at org.apache.logging.log4j.spi.AbstractLoggerAdapter.getLogger(AbstractLoggerAdapter.java:42)
 at org.apache.logging.slf4j.Log4jLoggerFactory.getLogger(Log4jLoggerFactory.java:29)
 at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:253)
 at org.apache.commons.logging.impl.SLF4JLogFactory.getInstance(SLF4JLogFactory.java:155)
 at org.apache.commons.logging.impl.SLF4JLogFactory.getInstance(SLF4JLogFactory.java:132)
 at org.apache.commons.logging.LogFactory.getLog(LogFactory.java:685)
 at org.apache.hadoop.hdfs.LeaseRenewer.<clinit>(LeaseRenewer.java:72)
 at org.apache.hadoop.hdfs.DFSClient.getLeaseRenewer(DFSClient.java:836)
 at org.apache.hadoop.hdfs.DFSClient.close(DFSClient.java:975)
 at org.apache.hadoop.hdfs.DistributedFileSystem.close(DistributedFileSystem.java:1214)
 at org.apache.hadoop.fs.FileSystem$Cache.closeAll(FileSystem.java:2887)
 at org.apache.hadoop.fs.FileSystem$Cache$ClientFinalizer.run(FileSystem.java:2904)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 at java.lang.Thread.run(Thread.java:748)

``

Attached is the flight-index.json file and the runtime.properties file.

i have also placed hadoop configuration XMLs (core-site.xml, hdfs-site.xml, yarn-site.xml, mapred-site.xml) into `conf/druid/_common/`.

Thank you

[flight-index.json|attachment](upload://unLyPoLhHtT1DsWbYwNuT3oesYY.json) (8.71 KB)



[runtime.properties|attachment](upload://qw2h2N5JLxeHzQwyloGQPfhaf99.properties) (4.38 KB)

Hi,
Can you attach complete task logs, after masking any sensitive info ?

there will be more information before this one which caused the failure.

sure.
attached document is complete task log.

log.txt (166 KB)

From the input file name it seems you might be trying to ingest csv file.

But the parser being used is a JSON parser which is not able to parse the input data. (Refer the ParseException in the logs)

http://druid.io/docs/latest/ingestion/data-formats.html has details on different supported format and the sample parsers.

Hope it helps.

I am really new to druid. I have created the index.json file from the another example, i am not aware of the parameters. I am trying to update according to the link you have given.

Thank you.

Is timestamp field compulsory in input data(csv files) ?

That’s correct, a timestamp field is required for Druid input data.

i am not able to ingest csv file to druid even after adding timestamp field to input csv files.

I am submitting job using below command:

curl -X ‘POST’ -H ‘Content-Type:application/json’ -d @flight-index.json http://localhost:8090/druid/indexer/v1/task

``

i am sharing error log, input csv file and index.json file.

can anyone help me to solve this issue ?

flight-index.json (7.13 KB)

log.txt (242 KB)

t1.csv (151 Bytes)

Are you using a remote Hadoop cluster?

Rommel Garcia

Director, Field Engineering

I am using hortonworks sandbox 2.6.5 on virtual box.

I think you need to include “hasHeaderRow" : true,

{

“type” : “index”,

“spec” : {

"dataSchema" : {

  “dataSource" : "Test",

  "parser" : {

    "type" : "string",

    "parseSpec" : {

      "format" : "csv",

      "hasHeaderRow" : true,

      "dimensionsSpec" : {

        "dimensions" : [

Its failing on the first row.

Eric Graham

Solutions Engineer -** **Imply

**cell: **303-589-4581

email: eric.graham@imply.io

www.imply.io

Hi Eric,

still its not working. i am sharing the error log.

Thank you.

flight-index.json (1.36 KB)

log.txt (160 KB)

Hi,

Your column header for the timestamp is time not timestamp.

Please try this -

“columns” : [“time”,“year”,“deptime”,“arrtime”],

“delimiter” : “,” ,

“timestampSpec” : {

“format” : “auto”,

“column" : “time”

Eric

Eric Graham

Solutions Engineer -** **Imply

**cell: **303-589-4581

email: eric.graham@imply.io

www.imply.io

already did this changes still same issue.

if i remove header than it successfully ingests csv file to druid.

Hi Jonathan,

May i know why timestamp field is required for Druid input data ? and what if input data do not have timestamp and want to ingest data to Druid ?

Thank you.