I have a dataset with multiple files for each hour. Unfortunately it can have null values "NULL" and NaN values "---" / "--" on multiple dimensions of different data types, essentially this is what I would like help dealing with.
During testing I just extracted the files and used 'sed' to quickly alter these to empty values. The data was then indexed without a problem. However in my eagerness to create a job that takes the data from a third party and stores it on S3, ready for Druid, I have forgotten to add this step to the process.
At my current level of understanding I would alter this job to include the extra step, as well as something to go through the 100+GB I already have; however if there is a better way to deal with this situation then I would love to hear it. I assume that many people are dealing with datasets that have erroneous values or require some pruning before ingestion, and therefore there is a more prescribed way to go about this.
Thanks very much. (If there is a better place for this question please point me in the right direction)