I noticed that the demo code for the Tranquility library seems to assume that you will have Samza or Storm in front of your Kafka feed. We don’t have one, we’re just feeding event data straight into Druid from Kafka. As I understand it, the Tranquility library just acts as an Overlord client. So, it should be possible to do something like this:
Service<List<Map<String, Object>>, Integer> druidService = DruidBeams.buildJavaService();
List<Map<String,Object>> listOfEvents = getEventsFromKafka();
Future numSentFuture = druidService.apply(listOfEvents);
Sounds simple enough, but I have a couple of concerns about this process. First, we can potentially ingest hundreds of thousands of events per minute from Kafka so indexing might take a long time. I’ve indexed about 10 million records with a batch task, and it took hours unless I used Hadoop. How fast will realtime indexing be able to process data?
Second, the realtime indexing task example just seems to turn a middle manager into a realtime node. Why is this any different from a normal realtime node? Can we use Hadoop index tasks to process realtime data in batches from Tranquility and avoid the need for a middle manager node? It seems like it would be somewhat more reliable than just another realtime node.