Hipaa Compliant Druid on AWS

Druid’s an amazing engine and we just wrapped up becoming Hipaa compliant (for handling sensitive health care data) hosted on Amazon web services. I haven’t seen anyone talking out Druid + Hipaa, so I figured I’d share an outline of what it took.

The TLDR; on Hipaa can be summed up with 3 rules

  1. If data hits disk, encrypt it
  2. If data goes over a network, encrypt it
  3. Your virtual machines may not co-exist with ec2 instances of another AWS customer

So how does that fit with Druid which has no built in security in a public cloud? Lets start with the easy parts first.

== Encrypted Disk ==

For EC2 instances like the index service and historical nodes, on disk storage is used for holding segments. Amazon has 1st class support for encrypted EBS volumes. We use M4 class machines and mount encrypted EBS volumes for Druid to use.

== S3 Deep Storage ==

S3 fortunately has built in server side encryption. We placed an upload rejection policy on our S3 bucket to prevent un-encrypted uploads. S3 checks our headers when requesting to upload and will throw an exception if the upload isn’t encrypted. That keeps us honest. Cloud formation snippet:

{“Sid”: “DenyUnEncryptedObjectUploads”,“Effect”: “Deny”,“Principal”: “",“Action”: “s3:PutObject”,“Resource”: "arn:aws:s3:::{$BUCKET_NAME}/”,“Condition”: {“StringNotEquals”: {“s3:x-amz-server-side-encryption”: “AES256”}}}

Druid uses the Jets3t library for uploading files to S3. Make available a jets3t.properties file on the druid classpath where you can enable encryption. EG


== Instance Tenancy ==

The authors of Hipaa guidelines must not trust virtualization tech b/c you cannot as a company share physical machines with another company for any machine handling sensitive data. Fortunately Amazon allows you to provision Dedicated Instances. They cost more, but the machine isn’t shared with anyone else.

We’re currently using opsworks and specify instance tenancy as dedicated when adding nodes. So broker/historical/index service nodes run on dedicated instances whereas the coordinator/overlord nodes run on a normal machine. The coordinator/overlord handle metadata about segments, but do not actually touch the segments, so we can save some dollars there. Alternatively, instance tenancy can be specified at the VPC level thus setting it globally.

== Kafka ==

If you’re ingesting sensitive data over Kafka then this is where things get fun. Unlike Druid, Kafka has built in secure transport. However, they didn’t add that until Kafka 9 which is relatively new. The only prior art I’m aware of for Druid using Kafka 9 is the new & shiny KafkaIndexTask added in Druid .9.1-rc1 which runs on the indexer service. Tranquility wasn’t for us and realtime nodes were hard to scale, but KafkaIndexTask bridges the gap and gives the best of both worlds. Plus it uses the Kafka 9 client, so we gave it a shot and we love it.

I might write a separate post about our migration to KafkaIndexTask, but for this I’ll just say once you’ve switched then enabling the Kafka secure transport is just a matter of generating a truststore and adding some consumerProperties to your kafkaIndexTask spec:

“security.protocol”: “SSL”,
“ssl.truststore.location”: “/etc/trust/truststore.jks”,
“ssl.truststore.password”: “password”

== Api -> Broker Communication ==

We run nginx on the broker nodes for SSL termination. Nginx accepts the HTTPS traffic on 443 and forwards it to the broker port.

== Broker Scatter Gather Query Encryption ==

The Broker node’s job is to take a query, scatter gather the Historical & Indexer Service nodes, and return the results. Yes, the KafkaIndexTask can serve realtime queries. That network traffic is not encrypted b/c druid has no concept of secure transport. We solved this with Stunnel. If you imagined port tunneling over ssh, you’re not far off from what Stunnel is. Here’s how it works.

Imagine we add a new node: Historical3 to our Opsworks stack. Opsworks triggers the “configure” lifecycle phase on all nodes in the stack. This gives us a chance to run a script on every node in the cluster every time the cluster adds a node. Cool, So Broker1 realizes Historical3 is new and we run historical nodes on Port 8283. The script goes:

  1. On historical, add stunnel listener on 7183
  2. On historical, add iptable rule routing from the stunnel listener 7183 to localhost 8283
  3. On broker, establish an Stunnel connection from broker node 7183->7183 on historical node
  4. On broker, add iptable rule to steal outgoing traffic to 8283 and re-rout localhost to 7183

The Rube Goldberg machine steals outgoing packets from the broker node, pushes them over an stunnel connection, and then back to the original port it thought it was talking to. Druid has no idea its traffic was encrypted. It’s the same story talking to the Indexer Service, but since there’s multiple peons/node we run steps 1-4 for each possible peon port.

Stunnel may be considered antiquated, but it works well enough for us. There are of course more modern equivalents, but we haven’t evaluated them yet because our benchmarks have yet to raise any issues.

== Replication ==

Rather than node-node replication like most NoSQL engines use, Druid uses deep storage (Like S3) as an intermediary for passing data between nodes. So long as you’ve encrypted your deep storage then you’re set.

== Hadoop ==

We’re using Druid’s hadoop job to rebuild our druid segments every night. That means the EMR cluster needs to be HIPAA compliant as well. I’m afraid to say covering that would triple the length of this post, so I’m going to cut myself short. It’s do-able, but requires a good bit of work and can be avoided if the indexer service can cover your needs.

== Conclusion ==

There’s other Hipaa considerations with network security, audit trails, etc. However if you were evaluating Druid, but shy about its lack of a security model, then I hope this gave you some ideas.

  • Drew

Whoa, Drew, thanks for posting this! That was an awesome read. If your employer is down with it, that’d make a neat blog post :slight_smile:

Really great post Drew! We are currently evaluating something like this internally and I’m sure many other organization would also like a sound architectural sketch of how to do thing end-to-end.

Would it be possible to share what you’ve learned to date about this and how you went about locking down EMR?

The entire Druid community could greatly benefit from this advice, and I’m sure it would increase the branding of Druid if they can show specific ways to make it really secure on AWS.


  • Robert

This is Great post. I would like to know more about the hadoop security section.

Nicely done! what is the size of the cluster now just curious if you don’t mind sharing…