Tranquility and geo-distribution

Hi,

as recommended I am looking into replacing our realtime nodes with tranquility.

We have a geo-distributed event source, i.e. multiple kafkas around the world which need to end up in a single druid database. So far we used linear partitioning and used a cluster offset to do that.

So far, it looks like I can use cluster-specific overlords to ensure the local consumption of the events, however I don’t see anything how to offset the partitions.

Is this just missing in tranquility or is there a better way to do this?

Thanks!

Hagen

Hey Hagen,

Ha, that’s a neat trick :slight_smile:

Nothing exactly like that is supported with the indexing service + Tranquility combo. Does one of these approaches work for you?

  1. If you have a “main” datacenter (the one with your deep storage and historicals, if they are all in the same place) then you could use the Kafka MirrorMaker (https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=27846330) to gather all your data in a Kafka cluster in the main datacenter, then do ingestion into Druid from there.

  2. You could do a datasource per datacenter, and query them with union queries.

BTW, for reading from Kafka, you could also take a look at https://github.com/druid-io/druid/blob/master/docs/content/development/extensions-core/kafka-ingestion.md. This is new stuff that will be an experimental feature in 0.9.1. I think that eventually, for people using Kafka, we’ll recommend this approach and then for everyone not using Kafka we’ll continue recommending Tranquility. The same approaches above (1/2) would work with the new Kafka stuff too.

Hi Gian,

we use kafka mirroring in production, enough to know I don’t want to do it at scale. I.e. not an option.

I tried to read up on union queries, but I fail to create a valid one. Can I do timeseries and groupBy? What happens if one data source is offline? Interesting option, but still a lot of open questions.

As far as I can tell, tranquility already does linear shards, so a simple patch to allow for an offset would be sufficient to take the old idea into the new world. I guess I have to learn some Scala now unless you see it useful as well.

Same is true for the kafka ingestions coming. Sounds great, but experimental features in production… mmmh.

Cheers,

Hagen

Hey Hagen,

Union datasources work with all query types, you should be able to do something like:

“dataSource”: {

“type”: “union”,

“dataSources”: [“one”, “two”]

}

Instead of:

“dataSource”: “one”

The behavior you get is that all segments from all unioned datasources are queried together, as if they came from the same datasource. If one of the datasources is unavailable then it won’t be included in the query results, but the others still will (same as if some segments from a single datasource were unavailable). If any of this ever doesn’t work then it is a bug, so please report it.

If you’re using a datasource-per-datacenter, then you can also tell the overlord stuff like “all work for datasource X should be given to workers in datacenter Y” (using a worker assignment strategy).

Tranquility does do linear shards. If you end up writing a patch that adds an offset feature and finding value in that, I’m open to merging that. It shouldn’t be too invasive. I think we’d want to mark it as advanced-users-only though.

About the Kafka stuff, someone’s gotta run experimental features in production in order for them to stop being experimental :slight_smile: