Handling/Discarding bad data in delimited text file hadoop indexer

I am attempted to index data directly from file(s) stored in S3 using the TSV spec:


“listDelimiter” : “\u0002”,

This works for most of the files I am using, however there are a number of records which have unescaped newlines causing hadoop to error out with an exception like so:

Caused by: java.lang.NullPointerException: Null timestamp in input: {id=cupcake gabbz, uid=true, event=includes_video, wid=null, feed_id=undefined}

Is there any way to set the indexer to ignore bad records like such and continue to the next row? Out of billions of rows only one or two look like this



So for posterity’s sake here are the steps to resolve:
1.) Changed to TSV and used custom delimiter of \u0001, changed the listDelimiter to \u0002 to fix it interpreting the row as a list

2.) Added the hadoop-aws.jar to the lib/ folder of druid (why is this not included by default?)


3.) Used the hidden ‘missingValue’ field in the timestamp spec to fix broken timestamps

4.) Had an issue with quotes in value, need a way to set the custom quote value to disregard quoted values:


Is there any plan/way to set the quote value for OpenCSV? This seems to break my mappers completely, the only option that I have currently is to reprocess 10TB of compressed text into JSON serde format, or strip all special chars from fields and use CSV serde. Are there any other options?