Is there any way to filter message

Hi team,

We’ve been playing with Druid for quite a while now. It is GREAT.

Now we are playing with the realtime ingestion part of Druid. And unfortunately, there are different types of messages with completely different schema in some topics of Kafka. So I’m wondering what is the best practice for Druid to (1) filter/split the Kafka json stream so that we can read only the message we want? (2) flat nested json?

Thanks!

Hey Qi,

Thanks for the kind words!

I think most people with that kind of problem are using a stream processor to create new topics with the right format of data. So, you could use Storm or Samza to read from your original topic and then write transformed messages to however many downstream topics you need.

Another option is to extend the Druid Firehose interface with something that and wraps the built in Kafka firehose and then does the transformations you need. That should work as long as your transformations are stateless. You could package that as a module and include it on your realtime nodes.

Hi Gian, Thanks! I’ll look into those.

Hi Gian. I think the second option looks a better thing for me to do right now. If it is ok, can you tell me more about it? Like which class to look into and what function to override etc.? I looked at the source code but don’t know where to start.

Thanks!

You’d be making a “FirehoseFactory”. It’s in druid-api: https://github.com/druid-io/druid-api/blob/master/src/main/java/io/druid/data/input/FirehoseFactory.java