Jason Taylor

Scaling Elasticsearch for Fun and Profit

I have been managing an Elasticsearch cluster at work for over 3 years. This time has been a constant learning experience to say the least. When I started working on this cluster, we were using a cloud provider that gave us exactly one tool for scaling up: a slider to provision more nodes and larger nodes. The standard practice at the time was to slide the slider up whenever the cluster started acting up1. This would give us “more capacity” for a couple months and then the acting up process would start all over again. This method of scaling also had the notable side effect of generating massive bills.

This post is a list of everything I tried that gave me positive results when the cluster was handling load or a concept that was important in helping me understand more about how Elasticsearch runs. The list is in no particular order, but I tried to group it by topic to make it easier to read. The most important takeaway you can have from this post is to make sure you are monitoring enough to see what effect each change has on performance as you apply it. This will help you determine what the cluster is struggling with the most and help you prioritize what to try next.

At the time of my applying the techniques described below we were using Elasticsearch 6, but I think a lot of this advice is applicable on Elasticsearch 7 as well. After version 7.10, Elastic changed licenses for Elasticsearch and it is no longer open source, so buyer beware. :)

General Advice

Index Management

Handling Writes Gracefully

Monitoring

Here is the list of metrics I’ve found helpful to monitor over the years.

ES Node Stats

ES Index Stats

Job Stats

Conclusions

So if you’ve made it this far into reading this, good luck with your cluster! Elasticsearch can run quite well if you are thoughtful about how you structure your data and set up your interactions between the application and cluster well. Our cluster has gone from being a consistent source of outages to stable enough that it “Just Runs” and reaching that point feels like real success. It can be done!

Footnotes:

  1. Acting up, in this case, being defined as the cluster becoming unresponsive due to being unable to process the volume of requests it was receiving. This ultimately would cause the site to go down because the web servers would be saturated with requests that were waiting on responses from the cluster.