In the last 10 years, Gian Merlino, Eric Tschetter and I have seen Druid usage grow beyond our wildest expectations to 1000s deployments all over the world. We have had the honor to hear about so many cool things that organizations are doing with Druid and listened to what you, the community, is saying about what you are looking for in a database to build faster, more scalable data applications. We all know that Druid is best in its class for sub-second, slice and dice queries, but as requirements for building new data pipelines have become more complex, we felt that Druid needed to evolve with the times. The community has asked for the ability to do more with their existing Druid clusters to speed up and streamline those data pipelines and other use cases that they think Druid would be a great fit. So today, I am super excited to announce that Imply and the PMC are working on a new and improved engine for Druid. (Check out the github issue here).
To summarize, we are proposing a new, multi-stage query engine. This new engine will open up the ability to run queries and ingestion using a single system. It will split queries into stages and will enable data to be exchanged in a shuffle mesh between those stages (in addition to the scatter/ gather we have today). This new engine will also enable doing ETL directly in the database using SQL (making the ingestion spec a thing of the past) AND access data in external systems directly (optionally separating compute and storage).
What does this mean to you? The upshot is that your end-to-end data pipeline should be significantly faster and much simpler to manage. You will be able to do more with just Druid, like longer running queries and data exports, but also get to external sources that may have previously required a different system or more complicated ETL. Running Druid will be less complex, less expensive and a ton more flexible.
The full blog post from Gian is here with a lot more details: https://imply.io/blog/a-new-shape-for-apache-druid/