Apache Druid projects can be smoothed out with some awareness of key questions that engineers should ask and be prepared to answer. Herein, a hit list of important considerations when transitioning from your quickstart cluster into something a little more serious…
Don’t Walk Alone
Work with your end users to sketch out what your Druid-powered user interface is going to look like.
As soon as you feel the early tingling magic of Druid, speak to Product Owners
- What dimensions will you need?
- Will there be common filters?
- What level of roll-up could be used?
- Do performance requirements on one visual differ to another one?
- Will there be interactive elements?
Why? To better inform your ingestion specification in all areas. That’s the dimensions you’ll need, the transformations and filters, any roll-up metrics you can apply, and whether you could use tiering.
And never ever forget to test the water with them about approximations - ask them directly whether an approximation based on data rolled-up to a second or a minute is good enough for the analysis they want to do - it probably is!
Further back in the pipeline, it also helps your design - it’ll open up questions of whether you need to continue enriching before data hits Druid or whether you can do it inside Druid - whether by lookup on ingestion or at query time.
The message here is - work together on the ingestion specification with your end user. Avoid just chucking everything in and hoping for the best!
Druid apps can be super special. Don’t miss the opportunity to modernise your application and how your end users will interact with it. Come to a meetup and ask a question in the Druid Forum for some UX ideas!
- Real-time data analysis starts with time as a key dimension.
BI doesn’t often focus on time - but time is central to Druid. Think: how would I incorporate timelines, time filters, time sorting, and time buckets into each element?
- Comparisons make people think differently.
Given what we said about time - let people use their full brain - instinctive and considered thought - by providing not just current data but old data, too. What would your UI look like if it showed not just this week’s spend, but also last week’s spend?
- Filters make one visual cover multiple contexts.
You can encourage exploration by allowing people to filter interactively. Some BI tools just don’t do that very well on large datasets.
- Measures make one visual cover multiple indicators.
Druid allows you to quickly generate multiple measures on the same underlying data - free your users from the issues they’ve had in the past with not being able to get those measures that they want. If the filter, sorts, and GROUPs are the same, it’s actually very efficient to ask Druid to give you multiple measures in just one query.
- Create data sources for different query patterns.
It’s very common for Druid-builders to have multiple data sources, with some that are RED HOT LAVA with critical dimensions, approximation, and roll-up, and filtering on ingestion to create a data source that is blisteringly fast to query, and to then have other, cooler data sources with more dimensions, finer granularity, maybe even raw data - but on which it’s accepted query time will be slower.
Screenshots of the Game Analytics and Pollfish user interfaces, making full use of Druid’s capabilities
Know thy query
Take time to familiarise yourself with how you construct Druid queries - especially the full gamut of options available in Druid SQL.
If you’re going to use SQL, check out and monitor the functions that are available to you, from approximation to array processing.
Whatever you want to use, be aware of the functions you want to use and integrate these into your testing - not just that they work, but that they work at a sufficient level of performance given your infrastructure and your schema.
Druid != Island
Think about how and what you will deploy onto your infrastructure, especially Druid’s dependencies
Druid has to exist in an ecosystem. It has service and infrastructure dependencies, and you should speak to architects about how that might be done - whether it can be done - in short order.
That could be network ports connecting to your HDFS cluster on GCP when Druid will be on AWS
Or it could be AWS Security Groups to allow you to connect to Kinesis from account X when Druid will be account Y.
More and more Kubernetes is used to deploy everything from Druid to potato salad. If you’re hot on K8s, remember that your network engineer may not be. That point when you can’t figure out why Kafka won’t ingest or you can’t write segments to the deep store - if it’s a network issue you could be on your own.
Druid itself has dependencies: a BLOB store, a relational database, and Zookeeper - take care to know their performance and capacity, but also longevity and supportability inside your organisation.
Jump Right In? Er… no
Yes, quickstart is great. It’s lovely! Some work has been done for you! Especially the configuration of the JREs that all the Druid processes run in.
We see people having issues because they haven’t provided enough Heap, for example.
Or because they’ve specified loads of heap but run it on a server with three tiny mice for memory.
Heap sizing is particularly important if you’re going to use Lookups.
Take a look at the tuning guide - there are calculations that will help you determine the right heap sizes, as well as some recommended JRE options.
There are some critical configuration properties that you should read up on and - as you work more with Druid - keep an eye on, tuning them for your precise ingestion volume and velocity and the query execution pattern on your underlying data.
This need for on-going maintenance of the configuration files may prompt you to use something more formal to deploy your configs other than just
At a minimum, make sure you have these set and know what they do:
||XMS, XMX, and MaxDirectMemorySize|
||XMS, XMX, and MaxDirectMemorySize|
||XMS, XMX, and MaxDirectMemorySize|
||XMS, XMX, and MaxDirectMemorySize|
Sometimes, people miss that, when ingesting streaming data, Middle Manager Tasks (aka Peons) also respond to queries. Just like the Historical process, then, they receive lookups (unless you say otherwise) that need to go into heap, and they also have their own equivalent settings for the
druid.processing.* you see above, but all prefixed with
druid.indexer.fork.property in the MiddleManager
The HTTP settings are also very often missed - these limit the number of queries that the Broker can receive and the size of the fan-out to Middle Manager Tasks and Historicals. There are a number of dials around
druid.server.http.numThreads. You might even delve into
Then there are the settings for the caches -
Druid is a highly distributed, loosely coupled system on purpose. Care for your interprocess communication systems and paths: especially Zookeeper and Http
Query, assignment / balance, and ingestion are all distributed functions. They rely on good interprocess communication. Monitor your traffic and your network and try to keep the network and the processes healthy.
What you don’t want is to run into an issue where Zookeeper is running out of memory and crashes, but nobody is there to help you fix it - or even to tell you that it happened. Familiarise yourself with your Zookeeper configuration and why it matters.
Love Your Log
Get to know the logs. For ingestion, particularly the overlord, middle manager and its tasks. For everything else, particularly the coordinator and historicals.
Anyone who spends time in the channels will see that people look at their logs.
This isn’t because we are all Ren and Stimpy fans. No, that’s because the logging in Druid is extremely detailed. From the start to the finish of every ingestion job is a detailed audit of process activity. Remember that wikipedia sample data in quickstart? Like me, you probably didn’t look at the ingestions logs or the process logs and use them to learn what actually happened!
Take a different road. When you go back to your Druid cluster tomorrow, or if you’re just starting out with Druid, open up that log folder. Go on! Take a look. Read through the logs and spend time understanding what Druid is doing for you. Then when you have an issue in the future, you’re much more likely to be able to understand what could have caused a blip.
Be agile: set up a lab, start simple and start small, working up to perfection
Druid is a sports car. If we did buy a Nissan GTR each, it would be foolish to drive it to the shops at 9.5 bn miles an hour. And it could be such a disappointment if you hop behind the wheel and think you’re going to get the most out of it only to crash out because you didn’t go out on the track first and learn all you could about it and your car.
When working with Druid KEEP IT SIMPLE: start small, with a discrete, well known data set that’s nice and simple. Then you can watch the dance of the different parts of Druid, honing and tuning each one as needed. ITERATIVELY. Working up to a goal. We get EXCITED - and we want to use new toys. But Druid is unique as a database, it’s combining OLAP, time series, and search technologies so wherever you’re starting from, there are going to be things you need to do differently than you have before.
Maybe don’t do transformations first, maybe don’t do roll-up first.
You WILL get there - and you will have a much less stressful journey along the way.
Digest the specifics
Learn ingestion specifications in detail through your own exploration and experimentation, from the docs, and from the community, and build each one as if it were code.
The docs tell you how the spec breaks down into the three key parts - data schema, io config, and tuning config. Each of those parts contains critical configuration information, whether that’s basics like the data types of the dimensions you’ll ingest, or something more complicated like sub partitioning.
For the ingestion you are going to do, it’s worth spending time looking at each part in turn and uncovering all the possible options that you have. It really is worth spending time doing some exploratory testing on each of the options - not just whether it works, but how using an option affects the data that ends up inside Druid - and almost more importantly, how it affects your ability to query and the speed at which those queries execute.
- Create a target query list
- Understand which source data columns you will need at ingestion time (filtering, transformation, lookup) and which are used at query time (filtering, sorting, calculations, grouping, bucketing, aggregations)
- Set up your dimension spec and execute queries, recording query performance
- Explore what other queries (Time Series, Group By, Top N) you could do with the data
- Add more subtasks and monitor the payloads
- Add more data and check the lag
- Use ingestion-time filter to eke out performance and storage efficiencies
- Use transforms to replace or create data closer to the queries that people will execute
- Use time granularity and roll-up to generate metrics and datasketches (set, quantile, and cardinality)
Historicals serve all used segments, and deep storage stores them. Query time relates directly to segment size: lots of small segments means lots of small query tasks. Segments are tracked in master nodes and registered in the metadata DB
Ingestion is where data is sharded. It’s sharded by configuration in the specification. It really is worth checking that the ingestion specification you have written is creating segments of an optimal size and content, clearly understanding those levers that you have.
First, we need to take care of the cost associated with Historicals. We need to scale out the Historicals with the right balance of storage and compute power according to our performance needs for the data we have in Druid.
Second, we musn’t let the Druid Deep storage grow and grow for no reason. We need to keep careful control of the infrastructure costs that come with this critical storage area in Druid.
Third, we must remember that each parallel query task that Druid runs reads ONE segment each.
If one segment is too big, or worse, if all of them are too big, queries will take too long to run.
But on the other end of the scale, if they’re too small, the processors and memory on the Historical servers become consumed with lots of tasks, each processing only very small amounts of data.
Finally, that the Druid cluster has a database of all the segments with its own compute and storage overhead. It’s just good administrative practice to ensure this database doesn’t get overloaded, perhaps by keeping records for thousands of segments with data in them that we don’t care about any more.
All in all, that means you need to control the COUNT of segments as that effects the memory overhead and processing power we need to run queries. Don’t underestimate the importance of compaction!
We also need to control the SIZE of segments not just to control storage costs but because size also affects query performance.
- Filter rows and think carefully about what dimensions you need
- Use different segment granularities and row maximums to control the number of segments generated
- Apply time bucketting with query granularity and roll-up
- Think about tiering your historicals using drop and load rules
- Consider not just initial ingestion but on-going re-indexing
- Never forget compaction!
- Check local (
intermediatePersistPeriod) and deep storage (
Collect metrics and understand how your work affects them
Druid emits all kinds of event data in its metrics. It’s timeseries data - use it like timeseries data! Surface it and use it to make better decisions, faster decisions, about how you’re using Druid.
For ingestion some key metrics are Kafka lag and - of course - GC collections. But also think about metrics that are affected by your ingestion: segment scan times and the number of segments scanned during a query. When you’re iterating through that ingestion specification, use these metrics to measure yourself as you approach full stability.
Some people even go as far as having JMeter watching query performance as each Ingestion Specification changes, and others even go so far as to use Flame Graphs (using Swiss Java Knife) to find exactly what is happening at the lowest of levels.
Important metrics include:
- 98th PERCENTILE QUERY TIME (most people are getting this)
- QUERY ID - very useful for over-time comparisons of similar queries (set programmatically)
- GC Time and Total GCs → Heap issues
- KAFKA LAG
- SUBQUERY COUNT + SEGMENTS SCANNED + SEGMENT SCAN TIME → optimising segments