Hi,
I have csv files stored in HDFS. I want to ingest them to druid. How can i do this ?
I am new to druid, using it for the first time.
Please help me.
Thank you.
Hi,
I have csv files stored in HDFS. I want to ingest them to druid. How can i do this ?
I am new to druid, using it for the first time.
Please help me.
Thank you.
Heta
Please check if this would be useful :
http://druid.io/docs/latest/tutorials/tutorial-batch-hadoop.html
https://docs.imply.io/on-prem/manage-data/ingestion-files
You will need appropriate Hadoop extensions.
I performed the steps given in the document you gave. but i am facing the error.
I am submitting the task using below command:
curl -X 'POST' -H 'Content-Type:application/json' -d @flight-index.json [http://localhost:8090/druid/indexer/v1/task](http://localhost:8090/druid/indexer/v1/task)
It throws below error:
2019-01-05 10:47:25,619 pool-1-thread-1 ERROR Unable to register shutdown hook because JVM is shutting down. java.lang.IllegalStateException: Not started
at io.druid.common.config.Log4jShutdown.addShutdownCallback(Log4jShutdown.java:46)
at org.apache.logging.log4j.core.impl.Log4jContextFactory.addShutdownCallback(Log4jContextFactory.java:273)
at org.apache.logging.log4j.core.LoggerContext.setUpShutdownHook(LoggerContext.java:256)
at org.apache.logging.log4j.core.LoggerContext.start(LoggerContext.java:216)
at org.apache.logging.log4j.core.impl.Log4jContextFactory.getContext(Log4jContextFactory.java:145)
at org.apache.logging.log4j.core.impl.Log4jContextFactory.getContext(Log4jContextFactory.java:41)
at org.apache.logging.log4j.LogManager.getContext(LogManager.java:182)
at org.apache.logging.log4j.spi.AbstractLoggerAdapter.getContext(AbstractLoggerAdapter.java:103)
at org.apache.logging.slf4j.Log4jLoggerFactory.getContext(Log4jLoggerFactory.java:43)
at org.apache.logging.log4j.spi.AbstractLoggerAdapter.getLogger(AbstractLoggerAdapter.java:42)
at org.apache.logging.slf4j.Log4jLoggerFactory.getLogger(Log4jLoggerFactory.java:29)
at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:253)
at org.apache.commons.logging.impl.SLF4JLogFactory.getInstance(SLF4JLogFactory.java:155)
at org.apache.commons.logging.impl.SLF4JLogFactory.getInstance(SLF4JLogFactory.java:132)
at org.apache.commons.logging.LogFactory.getLog(LogFactory.java:685)
at org.apache.hadoop.hdfs.LeaseRenewer.<clinit>(LeaseRenewer.java:72)
at org.apache.hadoop.hdfs.DFSClient.getLeaseRenewer(DFSClient.java:836)
at org.apache.hadoop.hdfs.DFSClient.close(DFSClient.java:975)
at org.apache.hadoop.hdfs.DistributedFileSystem.close(DistributedFileSystem.java:1214)
at org.apache.hadoop.fs.FileSystem$Cache.closeAll(FileSystem.java:2887)
at org.apache.hadoop.fs.FileSystem$Cache$ClientFinalizer.run(FileSystem.java:2904)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
``
Attached is the flight-index.json file and the runtime.properties file.
i have also placed hadoop configuration XMLs (core-site.xml, hdfs-site.xml, yarn-site.xml, mapred-site.xml) into `conf/druid/_common/`.
Thank you
[flight-index.json|attachment](upload://unLyPoLhHtT1DsWbYwNuT3oesYY.json) (8.71 KB)
[runtime.properties|attachment](upload://qw2h2N5JLxeHzQwyloGQPfhaf99.properties) (4.38 KB)
Hi,
Can you attach complete task logs, after masking any sensitive info ?
there will be more information before this one which caused the failure.
From the input file name it seems you might be trying to ingest csv file.
But the parser being used is a JSON parser which is not able to parse the input data. (Refer the ParseException in the logs)
http://druid.io/docs/latest/ingestion/data-formats.html has details on different supported format and the sample parsers.
Hope it helps.
I am really new to druid. I have created the index.json file from the another example, i am not aware of the parameters. I am trying to update according to the link you have given.
Thank you.
Is timestamp field compulsory in input data(csv files) ?
That’s correct, a timestamp field is required for Druid input data.
i am not able to ingest csv file to druid even after adding timestamp field to input csv files.
I am submitting job using below command:
curl -X ‘POST’ -H ‘Content-Type:application/json’ -d @flight-index.json http://localhost:8090/druid/indexer/v1/task
``
i am sharing error log, input csv file and index.json file.
can anyone help me to solve this issue ?
flight-index.json (7.13 KB)
log.txt (242 KB)
t1.csv (151 Bytes)
Are you using a remote Hadoop cluster?
Rommel Garcia
Director, Field Engineering
I am using hortonworks sandbox 2.6.5 on virtual box.
I think you need to include “hasHeaderRow" : true,
{
“type” : “index”,
“spec” : {
"dataSchema" : {
“dataSource" : "Test",
"parser" : {
"type" : "string",
"parseSpec" : {
"format" : "csv",
"hasHeaderRow" : true,
"dimensionsSpec" : {
"dimensions" : [
Its failing on the first row.
Eric Graham
Solutions Engineer -** **Imply
**cell: **303-589-4581
email: eric.graham@imply.io
Hi Eric,
still its not working. i am sharing the error log.
Thank you.
flight-index.json (1.36 KB)
log.txt (160 KB)
Hi,
Your column header for the timestamp is time not timestamp.
Please try this -
“columns” : [“time”,“year”,“deptime”,“arrtime”],
“delimiter” : “,” ,
“timestampSpec” : {
“format” : “auto”,
“column" : “time”
Eric
Eric Graham
Solutions Engineer -** **Imply
**cell: **303-589-4581
email: eric.graham@imply.io
already did this changes still same issue.
if i remove header than it successfully ingests csv file to druid.
Hi Jonathan,
May i know why timestamp field is required for Druid input data ? and what if input data do not have timestamp and want to ingest data to Druid ?
Thank you.