Real-time ingestion from CSV files

Hi. I would like to ingest a data from CSV files into RT node but can’t get usage patterns.

  1. Who should delete CSV files after ingestion is completed?
  2. How often RT node checks for new CSV file arrivals?

Looks like neither of 1\2 is performed by RT node…

RT nodes are used for realtime ingestion and there is really no csv file in the picture. Either you push events to RT node via tranquility or RT nodes pull events from kafka… Now, since you have your data stored on csv file (possible coming in batches periodically), you really want to use “batch ingestion”

  1. druid itself will not delete input data (csv files) and it has to be managed outside of druid.

  2. druid will not check, you have to do checks outside of druid and send indexer tasks to druid when new data arrives. Please see http://druid.io/docs/0.7.1.1/Batch-ingestion.html .

– Himanshu

Thanks, this confusion came from the docs, both RT and Batch ingestion mechanisms descriptions share same sources when user reaches DataSchema paragraph (for RT it just contains a link to http://druid.io/docs/latest/Ingestion.html where csv is described).
Anyway, I’ve tried batch ingestion and got well-known in these groups error:

javax.servlet.ServletException: com.fasterxml.jackson.databind.JsonMappingException: Instantiation of [simple type, class io.druid.indexing.common.task.IndexTask] value failed: null.

How hard is to make this error more user-friendly? I’ve checked validity of JSON file, checked format of CSV file - everything looks good, I don’t know what to do next, please suggest something.

Hi Robert,

can you share your spec file after removing any confidential info ?

Please find attached.

ix.spec (2.54 KB)

yeah, agree, that error is not most helpful. but, You should check overlord logs to see if there is more information in the stack trace.

on a quick look, it seems “granularitySpec” does not have correct fields. see the first index task example at http://druid.io/docs/0.7.1.1/Tasks.html

– himanshu

Fixed, but still same. Attaching debug logs from overlord:

Can you make your task as per shown in http://druid.io/docs/0.7.1.1/Tasks.html#index-task e.g. I don’t see any “spec” or “dataSchema” etc in the task json you had attached.

– Himanshu

Yes, that did it, thanks. But overall feeling was like catching a black cat in a dark room…
These wasn’t input mistakes, I’ve build my spec file based on http://druid.io/blog/2014/03/12/batch-ingestion.html which is just a year old.

I would suggest you guys following things (in the order of importance):

  1. Display user-friendly error messages. Once done this will save a huge amount of time for me and you (e.g. this thread would never exist if instead of “exception: null” I have seen something like "parameter “dataSchema” is not found in a spec file).
  2. Make a better doc navigation. Right now it’s a set of separate articles which are very hard to follow. I would never find a proper config for csv ingestion if you didn’t pointed me to it directly.
  3. Remove\update outdated articles. I’ve google-ed for a “Druid CSV” examples with no luck, then tried “Druid TSV” and it gave me the link with outdated config I’ve provided above.

Other then that project looks very promising, thanks.

Hi Robert,

Most of us volunteer our time to help the Druid community while balancing our day jobs. In the future, as more resources are put behind Druid, it will be easier to set up and evaluate. We recently reworked our documentation and it will go out with the next stable.

– FJ