An Apache Druid Skills Framework

“What do I need to know to run Druid effectively? What should I learn for a career with Apache Druid? How will our data team change?” The community give their view on the most essential technical and human skills you need to adopt and run Apache Druid®.

This post is a community work in progress, aiming to help people who are adopting technologies like Apache Druid. Reply to this post and make your suggestions!

Goals

By publishing and maintaining this framework, our ultimate mission is to increase the adoption of Apache Druid. This framework is for:

  1. Teams to identify people who will own (or need to be hired to own!) particular pieces of the puzzle
  2. Individuals to assess gaps in knowledge and experience, and create their own learning plan
  3. Project owners to define Epics / Stages of a successful implementation

Adoption Journey

Regardless of your goal, these stages are common to Druid’s adoption.

Stage Description
Design Defining the data pipeline, noting all the building blocks required, noting not only how they will be realised, but how and which data objects will flow through that pipeline and with what size, shape, and regularity.
Deploy Manually or with automation, assigning Apache Druid components and configurations to the infrastructure - including network services, routing and firewall configuration, encryption certificates - along with the three Druid dependencies: deep storage, Zookeeper, and a metadata database.
Build Using the features of Apache Druid within the pipeline to achieve the desired design.
Stabilise Hardening and additional tasks you would associate with good service transition, from defining OLAs / SLAs to training and educating your target audience.
Operate Monitoring, support and maintenance of the transitioned production system to meet SLAs.

These stages are iterative and fluid, rather than sequential. As the conversation between user and engineer develops in the Build stage, organizations may change their design or deployment.

The “Build” stage relies on three areas of functionality in Apache Druid®:

Feature Description
Ingestion Defining ingestion tasks that will bring statistics-ready data into Druid from storage and delivery services, (including schema, parsing rules, transformation, filtering, connectivity, and tuning options) and ensuring their distributed execution, led by the overlord, is performant and complete.
Database Maintenance Led by the coordinator, replication and distribution of the ingested data according to rules, while allowing for defragmenting (“compact”), reshaping, heating / cooling, and deleting that data.
Querying Programming SQL / Druid Native code executed by the distributed processes that are led by the broker service (possibly via the router process) with security applied.

Skills

The skills list is taken from the The Skills for the Information Age framework - the SFIA ("Sophia”). The SFIA is a not-for-profit skills framework in use in 180 countries around the world and translated into 10 languages, adopted by many organisations including the IEEE.

Before You Begin…

By whatever route Druid became your focus, from strategic Innovation, an IT department’s Emerging Technology Monitoring, a team’s Business Process Improvement assessments, a desire for better Knowledge Management, or failings identified through Service Level Management, there are some essential skill-holders you should try and find and work with throughout your process.

The Knowledge Management mission of “capturing, sharing, developing and exploiting the collective knowledge of the organisation to improve performance, support decision making and mitigate risks” is a common goal for Apache Druid deployments, whether that is opening up broad datasets to those who cannot current access them, or providing self-service analytics to alleviate the pressure on your analytics teams. If you can find someone assigned this specific role in your organisation, they will become a pivotal member of your change coalition.

When developing apps on top of Druid, especially, take care to identify those with skills in Product Management, Software Design, Systems Design, and Systems Development Management.

Take advantage of User Research and those with User Experience Analysis skills both to bolster your case for change and to design a vision that everyone can buy into. They are also very likely to give you an objective view of how your product needs to change to make life better for your users.

And if you’re handling a streaming telemetry data set - especially if it’s from your own product - find that knowledgeable individual who is engaged in Real-time / Embedded Systems Development early, ensuring that you share their vision for Druid’s use.

As you move forward, the usual raft of IT Management, Project Management and Requirements Definition and Management are going to be essential.

Remember to also look out for people who have Change Implementation Planning and Management skills: the greater your inspiration to create Apache Druid-inspired changes, the more this skill matters.

FInally, do not forget the people you’re trying to help. Polishing up on your team’s accuity in Relationship Management can never be a bad thing.

The Journey Begins…

Design

Whether your designs for improvement are large or small, it’s always a good idea to engage with someone who has skills in Solution Architecture. That person will hold a singular holistic view of how Apache Druid will be deployed, developed, operated, governed (strategies, policies, standards), secured, and put into production in a pipeline.

Often forgotten, however, is the need for skills in Network Design. In today’s cloud-connected world, it is important you find someone who can work out how data will be ingested in to Druid, and how it will get back out again to your querying application - not in software, but along pieces of wire or some other networking media. They will have views on the use of network services to improve availability and ensure security, and are positioned well to inform you about any costs associated with data movement of the scale that is associated with Druid-powered pipelines. For any Druid pipeline, even the sketch of a Network Plan is an important communication tool for everyone.

Engage early with those people skilled in Analytics, Data Visualisation and User Experience Design to specify the end result of your work. Having mock-ups of the output you need from Druid will define not only your ingestion and query requirements, but the performance requirements of your underlying infrastructure.

It is always useful to have someone skilled in Information Governance at your side at this stage. Their concern will be for “how all types of information, structured and unstructured, whether produced internally or externally, are used to support decision-making, business processes and digital services” - and they will also warn you when some data you’re going to open up or make available could be subject to legislation or policy.

Deploy

Deployment is the top “non-Druid” challenge for new adopters and long-standing users alike. It is critical that you identify people that will not only create your infrastructure and network, but also deploy the runtime and actual code to spin up the cluster.

Start by identifying those skilled in IT Infrastructure. They need to determine the “physical or virtual hardware, software, network services and data storage” required for Apache Druid - both itself and its dependencies. They need to become familiar with how to scale each constituent part horizontally and vertically, and how to monitor the load and performance of those components.

Watch a video on the Druid Kubernetes Operator from Splunk

During your journey, you will of course build Specialist knowledge in Apache Druid. But do not forget the dependencies that Druid has. Many first-timers are caught off guard by the need to wisely configure and monitor the JRE for each process, and to effectively monitor and resolve issues with Zookeeper - especially internode communication and memory allocation.

Druid is a Java-based system, requiring a compatible JRE running in a compatible operating system. Beyond that, there are dependencies on Zookeeper, a compatible metadata database, a suitable location for log files, (optionally) a metrics sink for your chosen emitter (essential for monitoring), as well as a compatible deep storage system. This all aside from your chosen infrastructure management system like Kubernetes, Ansible, or Terraform, and, of course, those services provided by your infrastructure provider.

Find someone skilled in Systems Installation & Decommissioning and System Software to understand and carry out all aspects of the deployment and who can understand each moving part, especially in larger deployments. And to that point, consider brushing up on your Release & Deployment to guarantee complete, consistent, and safe deployments.

Remember to find that person who will be providing Network Support to you as you go through your journey. They will quickly answer questions and resolve problems that you believe could be du to network communication issues into and out of the Druid cluster, as well as between processes. If you are using a containerised environment, make sure they know how to use, for example, HELM charts, to set up communication channels between the containers and each other part of your overall design.

When it comes to the software itself, It’s important someone knows the configuration options for Druid, perhaps applying version control, comparing configurations to understand different performance profiles, and allowing you to standardise your environments. In other words, good Configuration Management. Runtime properties for Apache Druid and the JREs are all configurable. And in a mission critical deployment, you might even consider query, DBM, and ingestion - whether that’s SQL code, your load / drop rules, or your ingestion specifications - as items that need to be managed.

And last but not least, think about pointing those with skills and opinions on Availability Management to the Apache Druid documentation. Measuring the availability of each component against targets (and establishing what those targets are) should start early, informing what you will deploy and use to monitor health and performance. Your team needs to know how much of Druid can be “unavailable” before it’s really unavailable (e.g. redundancy and replication), to know where to collect state information about services, interfaces, and data, the tools they need to test and carry out disaster recovery, and building out a way to plan for maintenance in what will quickly become a Zero Downtime service.

Build

Avoid creating a superhero: don’t hire just one person to own everything that’s created with Druid. Instead, create a close-knit and skilled team who work through issues together and who offer up new and different ways of doing things. And encourage them to join the community from the start, of course!

Using Database Design and Data Modelling & Design will help you work out how what may be quite complex data sources will be ingested into, stored, and queried from within Druid. And, importantly, how that data holds value for the business. You can make informed decisions and recommendations about where your data in Druid will come from, how (and where) it needs to be prepared for statistical analysis (e.g. cleansing, filtering, transforming, enriching) and ultimately what it must look like when it’s made available for query. It is dangerous, especially if your goal is self-service analytics, to leave this up to experimentation, hoping your users will get what they want because you’ve given them direct access to a self-service cheesecake much faster and with more filling. Knowing the intrinsic value of the data you’re going to make available will show that you are using Apache Druid for everyone’s benefit, and not just because you like cheesecake.

If your process of bringing data into Druid is complex or accuracy and transparency is essential, consider finding someone skilled in Information Assurance to give your users confidence that the data they’re seeing is reliable and trustworthy. This can be essential if you need to introduce approximate aggregations like HyperLogLog, Thetasketch, and Approximate Quantiles.

Much of the information about what Druid is doing is exposed in the main console. This will be of utmost interest to people skilled in Database Administration. Afterall, Druid is a database - and to run a database you need DBA skills. Knowing the moving parts of Apache Druid’s database, as well as the system schema tables, allows you to understand whether Druid is well-configured and operating well. As well as understanding functionality for data definition, manipulation, and control, it becomes important to know how to carry out the usual gamut of DBA operational jobs: interpreting the myriad of log files from the distributed processes (perhaps the #1 skill of all!), how to apply information management policies, what conditions can lead to database-level locking, how to take backups, and so on.

As soon as you ingest data into Druid, you will start to think about effective Storage Management. Druid’s datastores must be designed to account for legal and policy compliance, availability requirements, and for performance. It’s important to know how data can be isolated and protected, how tiering can be used, how disk traffic (initial load, rebalance, segment numbers, segment size) and disk types (DAS / NAS / SAN) affect throughput, what options there are for encryption at rest, and how datastores are made resilient and exactly what process will be followed to recover in face of disaster. This doesn’t mean just Deep Storage, but also the local segment caches on Historical processes, any and all log4j log stores, and so on.

There are four Programming & Software Development languages in Druid: the first three, SQL, Javascript, and Java, will be no surprise, allowing querying, user defined functions in a number of areas, and understanding / contributing to the underlying open source Java code. In particular, as regards Java, being able to look at - and contribute to - the code of Druid is always welcome by the community.

But there is a fourth - the instructions constructed as JSON for both ingestion and native Druid queries. Prowess in this fourth programming language of Druid cannot be avoided. Treating it like code - whether that’s source control or A/B testing - will pay dividends throughout the project.

Security Administration through Druid’s authentication, auditing, and authorisation capabilities will feed back into your design. Gain knowledge in these areas quickly and ensure you create a solution that complies with all manner of compliance requirements.

Stabilise

It goes without saying that you need to be good at Testing in support of Service Acceptance activities. But have you worked out how you will do this testing in a Druid pipeline? What will their criteria be in all aspects of the design? What approach will you use? What tools? Herewith then a recommendation for good and well prepared testing and service acceptance resources in your team.

As Druid is so awesome (!) the data you make available for analysis is bound to grow in depth and breadth. Good Capacity Management ensures you can meet current and forecast needs in a cost-efficient manner, and can make evidenced-based recommendations about future investment. To do this they need to understand the key measures of a cluster, including raw versus stored data sizes, query execution time, ingestion lag, data latency (especially for streaming ingestion), indexing throughput, real-time-to-historical query ratios, cache hit rates, and (of course) data quality indicators.

Operate

Finally, you are ready to launch! Make sure that you have covered all your bases for support: Application Support, Problem Management, and Incident Management - amongst others of course.

But don’t forget upgrades. Apache Druid releases are frequent, and each one brings with it performance and usability enhancements as well as fixes. Make sure you have a team around you that can identify whether that next release will bring value, and that they know exactly what they’re doing when they want to upgrade.

This is a community framework

Is anything missing? Would you change any of the content? Reply and make your suggestions for improvements known. Let’s upskill together!

2 Likes