Help ingesting non-delimited JSON objects in file.


I have been using AWS Kinesis Firehose to write events in JSON format into S3 files. The events for the past few months are not delimited, aka, the contents of a file are:


instead of the recommended:




Is there a way I can ingest these historical files in their current form without (without delimiters) having to pre-process them to add delimiters?

Thank you in advance for any help.

Hey Carlos,

Assuming you are doing batch ingestion through Hadoop/EMR, Druid’s Hadoop indexer uses TextInputFormat by default, which splits on newlines. I’m not sure if there is another InputFormat out of the box that will do that, but if not, you could write an InputFormat that splits the objects properly and then use that in the Druid indexing job.

Thank you Gian. I will look into this.