Is it possible to extract multiple InputRow from one event such as one kafka message?

Hi guys,

We’ve encountered this problem: as an ad exchange, we send bid requests to dsps and receive bid responses from them, when do logging, to save space and the need to join bid request and bid response, we pack one bid request and all relevent bid responses into one single log event, the schema is as follows.

event: {request_info_0: string, request_info_1: string, … , responses: [{response_info_0:string, response_info_1:string, …}]

As you see, we can’t feed the log event into druid directly: we have some metrics in responses, so we cannot utilize the multivalue dimension. Then we flatten the log events using a samza task. We think it will be simpler if we can extract List instead of one single InputRow from one event, or is there other better ways?

Hi Weinan,

I think there are ways to handle joins in the manner you’ve described without code changes to Druid.

Metamarkets actually deals with your use case every day.

We wrote up a blog post about how we handle joins in Samza for delivery to Druid:

Gian also covered this topic a bit more in detail at a recent samza meetup:

Let us know if you have questions.

– FJ

Hi Fangjin,

Yes it’s more simpler to use samza task to do the stream manipulations, but it needs more space and network bandwith on kafka brokers by several times, when dealing with huge-throughput streams it’s notable… which can be avoid if we can extrace multiple rows from single message?

BTW: the link for samza meetup talk is dead…

在 2015年5月9日星期六 UTC+8下午1:57:55,Fangjin Yang写道:


If you write your own Parser or Firehose implementation and add it as
an extension, it is entirely possible to have one event turn into
multiple InputRow objects, the fact that one event is turning into
multiple InputRows is just a bit of state hidden inside the Firehose.


Hi Eric,

Yes, for realtime ingestion I need a new multi row firehose and parser, for hadoop ingestion need an tweaked IndexMapper.

I’ll tackle it down once I got some time. Thanks.

在 2015年5月24日星期日 UTC+8下午10:33:32,Eric Tschetter写道: