I'm looking into native batch ingestion and trying to understand
exactly how the prefetching firehose works. Let's say you're trying
to ingest 100 S3 files.
The way I understand it, it looks like you can do one of two things:
- Create an 'index' task specifying those 100 S3 files. The
StaticS3FirehoseFactory can be configured with maxFetchCapacityBytes,
prefetchTriggerBytes, etc to start downloading those 100 files in the
background. It will fetch files in parallel with running the
specified StringInputRowParser over preceding files and publishing the
new data. This is presumably more efficient that not doing *any*
downloading while parsing and publishing. However, there's no
parallelization of the work, just a bit of pipelining.
- Create an 'index_parallel' task specifying those 100 S3 files. This
will create a series of tasks (up to maxNumSubTasks at a time), each
of which fetches one file and then parses and publishes it. This is
fully parallelized, but there's no pipelining: any given task is
either fetching or parsing at a given time.
It feels to me like the ideal "best of both worlds" is to be able to,
say, create 20 tasks each of which have 5 files. That way each
individual task can pipeline (download subsequent files while parsing
current ones) but still achieve some parallelization. Am I missing
something, or is this not possible? (Or is this fancy enough that I
should just be using Hadoop batch ingestion?)