How to improve performance of druid?

Hello all,

I have a druid setup in a single box in EC2 - c3.8xlarge (32 CPU and 64GB RAM)

started it with the guides here - http://druid.io/docs/latest/tutorials/quickstart.html

Added the data for aug month(21GB of json files) sep month(134GB of json files) to druid.

Installed metabase to create dashboards.

Added few questions and queries to dashboard.

When choosing the date range for one to few days, the results are fast.
When we choose the date range to one month or two months, the results takes around ten minutes.

We are keep on adding data to druid daily around 1 GB of data. I think it will take more and more time for the query results, when we have the data for few more months.

How can we increase the performance of the queries? Will adding similar one or two nodes will help?
If we add more nodes, what are the druid services to move there and how to move them?

Share your thoughts.

Thanks.
Shrini

Hi Shrinivasan,

I’d recommend you start with the clustering docs here which gives some recommendations on services you can co-locate: http://druid.io/docs/0.9.1.1/tutorials/cluster.html

When you move to a clustered setup (which you really should when preparing for production / querying that much data) you’ll need to have a distributed storage system (HDFS, S3, NFS) and should also set up an external metadata storage instead of relying on the Derby DB used in the quickstart.

As for why the queries are taking so long, it’s very likely because the configuration included with the quickstart isn’t intended for querying 150GB of data and your historical node is scanning through a ton of segments sequentially using a handful of threads while the majority of your processing power is not being used. Once you’re comfortable with the process of loading data into Druid and querying data, I’d suggest understanding how to set up a cluster which should give you immediate performance benefits. Once that’s done, you can look through the configuration pages to get some idea of what parameters can be tuned to get even better performance.

Thanks for the reply David.

I have setup a two node druid cluster to explore the clustering features.
Added co-ordinator,overload,broker,zookeeper, metabase in nodeA (4
CPU, 15 GB RAM, 100GB hdd)
added hostorical, middlemanager in nodeB (32 CPU, 64 GB RAM, 500GB hdd)

Cluster is working fine. Can query data using curl and metabase.

Hope nodeA is not a perfect one. Have to add more RAM/CPU for it.

Thinking on giving one node for each component similar to nodeB, so
that all components get better hardware.
Will do it tomorrow and share the results.

Please explain about the deep storage. I dont know about HDFS. Dont
have a hadoop cluster.
But, I have s3 and NFS.
Currently, the segments are stored in local storage of nodeB.

If we configure s3 for deep storage, will this improve the performance
of queries?
How can a network storage perform well then a local storage?

Thanks.

Hi Shrinivasan,

Glad to hear things are going well.

You should definitely explore setting up a distributed deep storage using either S3 or NFS since this is a necessity for a robust and scalable production setup. Using the local storage won’t allow you to scale beyond a single machine and puts you at risk of a disk crash / corruption causing the loss of all your data. Once your data is in a distributed storage, the segments can be generated and accessed by multiple nodes which will allow you to scale your ingestion and data serving processes and will allow Druid to shuffle segments between nodes to optimize queries. It will also make your data much more resilient - i.e. you could suffer a failure of all your Druid nodes and still rebuild the cluster. It would also be worthwhile looking at running an external metadata storage such as MySQL or PostgreSQL for data resiliency.

As for query performance, the deep storage is not involved in the query path and should have no effect on query performance. Segments have to already be loaded by a historical to be queryable; in other words, Druid will not fetch a segment from deep storage in response to a query, so you need to make sure your historicals have enough capacity to serve the segments you’re interested in.

Thanks for the suggestions David,

Here is my cluster setup.

Added co-ordinator,overload,broker,zookeeper, metabase in nodeA (32 CPU, 64 GB RAM, 500GB hdd)
added hostorical, middlemanager in nodeB (32 CPU, 64 GB RAM, 500GB hdd) same as nodeA

Now, I can see the performance on the replies on queries.

We have built a web dashboard which runs 10 queries based on the used defined intervals.

The results are shown in 5-20 seconds in average.

How can we improve the performance even more as to get the results always in 5 sec max?

Will adding more nodes help for this?
What are the components to be moved on new nodes?

Thanks.
Shrini

Hey Shrini,

Performance tuning is always a little tricky. A good place to start would be to look at some suggested production cluster configurations to make sure that none of your settings are badly misconfigured (you’ll have to adjust for the size of your machines compared to the recommendations): http://druid.io/docs/latest/configuration/production-cluster.html

Enabling metrics will help you to understand where your query bottleneck is. Configuration settings for enabling metrics is described here: http://druid.io/docs/latest/configuration/index.html

Descriptions of the metrics can be found here: http://druid.io/docs/latest/operations/metrics.html

In general, more historicals should give you better query performance, and most clusters have significantly more historical nodes than any other node type. Other things that in general improve performance:

  • more RAM for historical and broker nodes
  • using SSDs instead of magnetic disks
  • using segments that are sized somewhere between 500-1000MB

Having said that, it’s hard to know exactly where you’ll get the most return for your effort without knowing more details about your data, queries, and having cluster metrics. Hope this at least gives you some ideas of where you can start.