Native batch ingestion and partial parallelization

I'm looking into native batch ingestion and trying to understand
exactly how the prefetching firehose works. Let's say you're trying
to ingest 100 S3 files.

The way I understand it, it looks like you can do one of two things:

- Create an 'index' task specifying those 100 S3 files. The
StaticS3FirehoseFactory can be configured with maxFetchCapacityBytes,
prefetchTriggerBytes, etc to start downloading those 100 files in the
background. It will fetch files in parallel with running the
specified StringInputRowParser over preceding files and publishing the
new data. This is presumably more efficient that not doing *any*
downloading while parsing and publishing. However, there's no
parallelization of the work, just a bit of pipelining.

- Create an 'index_parallel' task specifying those 100 S3 files. This
will create a series of tasks (up to maxNumSubTasks at a time), each
of which fetches one file and then parses and publishes it. This is
fully parallelized, but there's no pipelining: any given task is
either fetching or parsing at a given time.

It feels to me like the ideal "best of both worlds" is to be able to,
say, create 20 tasks each of which have 5 files. That way each
individual task can pipeline (download subsequent files while parsing
current ones) but still achieve some parallelization. Am I missing
something, or is this not possible? (Or is this fancy enough that I
should just be using Hadoop batch ingestion?)


Hi Dave,

The native parallelIndexTask is supposed to prefetch input files, which in turn pipeline as well. Did you observe any behavior which doesn’t pipeline in parallelIndexTask?


Heh, this is just code inspection, not actually running it :slight_smile:
My interpretation of AbstractTextFilesFirehoseFactory.getSplits and its caller in SinglePhaseParallelIndexTaskRunner is that each sub-task ends up with a single one of the “abstract text files” allocated to it — is that wrong?


Hi Dave,

it’s correct, but there’s something more if you read from s3. If you want to read files from s3, you need to use static-s3 firehose which is implemented in StaticS3FirehoseFactory. This is a subclass of PrefetchableTextFilesFirehoseFactory which in turn a subclass of AbstractTextFilesFirehoseFactory.


Right, but each task still just downloads one file?

Ah, yes it’s correct for now. It should be improved in the future.

Thanks for reminding me!