Hadoop Indexer fails on temporary file creation

Hey all, I’m a bit new to using Druid and I’ve run into a problem when trying to use the hadoop indexer to ingest a bunch of JSON data (about 150GB worth of data, approx). This was either the third or fourth MapReduce task that the indexer started (I’m not totally sure it should have run that many tasks?) and toward the end I got an error like this:

Error: java.io.IOException: Unable to create temporary file, /mnt/var/lib/hadoop/tmp/nm-local-dir/usercache/hadoop/appcache/application_1453993122873_0003/container_1453993122873_0003_01_000532/tmp/filePeon4308725064636988861 Blair O’Neal Host New Season of Sexiest Shots - Golf Digest Videos - The Scene#/watch/golfdigest/the-sexiest-shots-in-golf-sports-illustrated-s-kelly-rohrbach-blair-o-neal-host-new-season-of-sexiest-shots?mbid.header

Is there a setting in the specfile where I can choose which variable it uses to create temp file names? I believe this failed because it has a ’ in it (though perhaps it was just space? That actually just occurred to me).

In addition, can anyone tell me how many MR jobs I should expect the indexer to run before completing?

Thanks for all the help

Quick update on this - I figured out that the problem is that the file name has slashes in it. Now that I know that, wondering if there is an option to set how this temp file gets made?

Hi John,

Druid runs 2 or 3 jobs depending on how configuration is set. The first one, a determine intervals job, is skipped if you’ve properly set the intervals of your data. After that, it runs a determine partitions jobs to decide how segments should be made, and then it runs an index generator job that creates the segments.

For a full list of configuration you can set for hadoop based batch ingestion, check out:

http://druid.io/docs/latest/ingestion/batch-ingestion.html

Hey Fangjin, thanks for the reply. I have two questions from that. First, how do I properly set the intervals of my data? Is it that I am trying to ingest data that’s not within the specified interval?

And second, I’m still unclear on how to set the temp file filenames. It seems like that is handled by Druid. I tried poking around in the source code to see if I could find anything, and from a combination of the error message and the code I think that this is failing in GenericIndexedWriter.java (probably around this line: headerOut = new CountingOutputStream(ioPeon.makeOutputStream(makeFilename(“header”)));

I think this because the end of file has a .header in it, so it’s definitely making a call to makeFilename(). I’m not sure where it is getting the fileBasename, though. Do you know how that is getting set? I think this is what is causing the indexer to throw the IOException. Thanks again for the help

Hey John,

It looks like Druid’s trying to write out a column called “Blair O’Neal Host New Season of Sexiest Shots - Golf Digest Videos - The Scene#/watch/golfdigest/the-sexiest-shots-in-golf-sports-illustrated-s-kelly-rohrbach-blair-o-neal-host-new-season-of-sexiest-shots?mbid”, and it can’t do that because of https://github.com/druid-io/druid/issues/2370. Possibly some of the other characters in there are problematic too.

If I had to guess, you’re using schemaless dimensions (leaving “dimensions” empty in your parser) and one of your input records has a field with that text as a key. To get this data indexed you’ll need to either specify a specific list of dimensions, or remove objects with problematic keys, or put the objects through a pre-processing step that changes those keys to something else.