I am attempting to index data with Druid 0.10.0 (local quickstart tutorial at the moment, nothing in production) through an AWS Lambda job where I process data on specific events, which is from S3 at an earlier point in time. The data is not formatted in the way that druid would expect and some records may need to be discarded. The data is not within the default time window so I can’t ingest it in real time through Tranquility. The docs state a batch ingest job that has a static input type needs to have a path to S3 or HDFS. Is it really necessary to store my data in S3 again, or can I do this through another way that doesn’t involve extra storage?
You can use a “local” firehose too in order to read data from the local disk of the Druid nodes, but keep in mind it will read from Druid’s local disk, not the local disk of whoever is submitting the index task. So you would need to copy your data file to the Druid machine(s) first. In general Druid indexing is pull-based rather than push-based, so Druid generally needs to have somewhere to pull the data from.
Also in the case that I am able to do streaming in the 10 minute window, does the order of the timestamps within the records matter? The data I get won’t come in order in some cases.
For realtime ingestion, the order of the records doesn’t matter, as long as they all arrive within any relevant deadlines.