Druid 0.9.1-rc1 available

We’re happy to announce our next release candidate, Druid 0.9.1-rc1!

Druid 0.9.1 will contain hundreds of performance improvements, stability improvements, and bug fixes from over 30 contributors. Major new features include an experimental Kafka Supervisor to support exactly-once consumption from Apache Kafka, support for cluster-wide query-time lookups (QTL), and an improved segment balancing algorithm.

You can download the release candidate here:

http://static.druid.io/artifacts/releases/druid-0.9.1-rc1-bin.tar.gz

Draft release notes are at:

https://github.com/druid-io/druid/issues/2999

Please file GitHub issues here if you find any problems with the release candidate:

https://github.com/druid-io/druid/issues/new

Thanks to all of you who contributed issues, docs, and code!

One more thing! Documentation for the RC is here: http://druid.io/docs/0.9.1-rc1/tutorials/quickstart.html

why http://static.druid.io/artifacts/releases/druid-0.9.1-rc1-bin.tar.gz has not mysql-metadata-storage in extensions?

Hi Gian, This is a great milestone.

Few things i would like to understand.

  1. Now that the window granularity concept is gone. Do we need to run a batch ingestion task separately from hadoop or spark.

  2. Can we not publish the batch data through kafka and in turn this will ingest.

  3. If we still need to run batch ingestion through hadoop or spark, what will be the ideal use case for those.

Do we have a latest release for batch hadoop ingestion dependency with respect to 0.9.1-rc1 release.

Regards

-Sambit

Zhenyuan: MySQL is GPL licensed. We can’t include it in the tarbal without it infecting the license of the distribution.

Sambit, inline.

Hi Gian, This is a great milestone.

Few things i would like to understand.

  1. Now that the window granularity concept is gone. Do we need to run a batch ingestion task separately from hadoop or spark.

You don’t necessarily require batch ingestion, but please take not that Kafka doesn’t yet support exactly once event production, so if you need 100% accurate data, you should still have a batch pipeline.

  1. Can we not publish the batch data through kafka and in turn this will ingest.

I’m not sure what you are asking here but you should be able to stream historical events from Kafka into Druid.

  1. If we still need to run batch ingestion through hadoop or spark, what will be the ideal use case for those.

The ideal case is for reprocessing, for example, changing the schema of older events.

Do we have a latest release for batch hadoop ingestion dependency with respect to 0.9.1-rc1 release.

The batch ingestion mechanism in Druid did not change from 0.9.0.

Hi people,

I have a doubt about what Sambid said:
is it true that the window granularity concept has disappeared? Is it possible to ingest very old data through the RT interfaces?

Regards.

Yes, you should be able to ingest historical data too, but only through the new Kafka supervisor feature. The existing realtime interfaces (Tranquility / Realtime node) have not (yet) been modified.

Thanks Fangjin. Yes i agree if we need 100% accuracy we can setup batch ingestion. But for most of the realtime use cases accuracy eventual consistency is not priority. I follow the principle of velocity, exactness(accuracy) and volume - we need to chose 2 at a time. Its hard to achieve all 3 together. Right ??

Regards

-Sambit

I think if Kafka can support exactly once event production we’d get all 3.

We can actually handle exactly once in kaka at the consumer side using subscribe api and adding rebalance lister + external storage to maintain the offset per topic. But this change is required in the tranquility side which is the consumer of Kafka. This will though reduce the throughput a bit since we are guarantying the exactly once behaviour.

Regards

-Sambit

Hi Gian,

From what I understood (and please correct me if I’m wrong) the concept of windowPeriod is being phased out and therefore we will not need to do hybrid realtime/batch ingestion in the future.

We are having some serious trouble with Hadoop batch ingestion because we have a dependency on Microsoft Azure, so we would ideally like to be able to stream historical (with old timestamps) events to Tranquility. Right now we cannot do that because events get rejected by the Tranquility Server based on their old timestamp. (P.S. we are not using Kafka at all so the Kafka Supervisor feature would not work for us)

Is there any other way to achieve the same result? For example by setting some config on the Tranquility Server?

Furthermore, when do you think Tranquility will be able to support that? Is there a timeline for that?

Great work by the way with the new release! Thank you very much!

Petros

Hey Petros,

We hope to gradually make windowPeriod no longer required for any ingestion option. The Kafka supervisor feature in 0.9.1 is the first step towards that. There’s no solid timeline for adjusting Tranquility as well, but that would probably happen after the Kafka supervisor feature has matured a bit.