Best way to ingest streaming data from AWS Kinesis

Hi everyone!

I am working on setting up a new Druid cluster that will be ingesting high throughput (5000 events / second) advertising click stream data. We’re designing our data to stream through AWS Kinesis, with processing in AWS Lambda.

Could anyone provide feedback on which of the following would be the best solution for ingesting (main criterion are stability/reliability, we don’t want to lose any data):

A) Inside AWS Lambda function, make PUT calls to Tranquility which will submit them to Druid for indexing. Tranquility run in blocking mode (blocks until data submitted to Druid successfully), so that it Tranquility/Druid ingestion doesn’t get overwhelmed. Drawback for me is running/monitoring Tranquility servers (at least 2 for reliability)

B) From AWS Kinesis, using AWS Kinesis Firehose to write the incoming data to S3. Then have an extra AWS Lambda trigger a Native Batch Ingestion task on every S3 file creation (AWS Firehose can be configured to write every X minutes). Pros are no need to run Tranquility, cons are that there will be a X minute delay for getting data in, and I am not sure what to make of the fact that native batch ingestion takes out a lock on the interval, so at most 1 file in s3 can be batch ingested at a time if say granularity is 1hr, but files appear every 5 minutes. What would happen if ingestion is not keeping up with the incoming data?

C) Send the data to a Kafka stream, and use kafka-indexing-service to read from there. This seems the most reliable to me, but requires an extra Kafka cluster in addition to already using AWS Kinesis

Ideally there would be a Kinesis indexing service…

Thanks everyone, looking forward to your advice!

Best,

Michael

Hi Michael

I am open sourcing the Kinesis Indexing Service, which ingests data from Kinesis directly. It works similar to Kafka Indexing service.

Check out the PR here: https://github.com/apache/incubator-druid/pull/6431

Josh

Hi Josh,

Thanks, that would be amazing! Good luck with the work and getting the PR through.

The more I am trying things out, the more I am realizing the pull model from Druid is the way to go (vs push through Tranquility), and it feels wrong to have all my stream processing in Kinesis, only to then put the records in Kafka so can use the kafka-indexing-service to read them into Druid.

If the code does get merged in, I wonder when the next release containing it would actually be. Probably I will still have to go with one my approaches above, at least temporarily until this new extension is out.

Thanks again,

Michael

Hey Michael,

I would guess the next major version after 0.13.0, probably sometime later this year.

Of course, if you like living on the edge, you could always pull the code from Josh’s branch, build it and try it out :slight_smile: