Resizing your Elasticsearch Indexes in Production

Fred de Villamil
Fred Thoughts
Published in
2 min readMay 16, 2017

--

Size does matter

This article and much more is now part of my FREE EBOOK Running Elasticsearch for Fun and Profit available on Github. Fork it, star it, open issues and send PRs!

One of the burdens with managing thousands of living indexes within the same Elasticsearch cluster is keeping your shards manageable.

When you first design your index, it's hard to predict how big it's going to be in 1, 3, or 9 months. Starting with too many shards puts lots of pressure on your master nodes. It's even counter productive when you're using routing as it will leave most shards unused.

On the other hands, large shards cause lots of problems too. They're slower to recover, might block the cluster reallocation, and make optimizing impossible. I once ended with 900GB shards, on 1.2TB sized servers, making my life a nightmare.

There's no silver bullet but reindexing your whole indexes, which is not always possible on a production cluster. You have two solutions left:

  • Moving your indexes from one cluster to another.
  • Duplicate your indexes, and use Elasticsearch reindex API with aliases.

Get the sizing right

Experience taught me 10GB shards offers the most competitive balance between allocation speed, nodes balancing, and overall cluster management.

With an average of 2GB for 1 million documents, for example, I'll use the following:

  • From 0 to 4 million documents per index: 1 shard.
  • From 4 to 5 million documents per index: 2 shards, so the index can still grow without causing too much problems in the future.
  • With more than 5 millions documents, (number of documents / 5 million) + 1 shard.

The more data nodes you have, the better it works when you need to work with thousands of huge indexes (up to 300 million documents) in the same cluster.

Here's a small script I'm using to resize and move things. Indexes are prefixed with a version number and aliases are not.

#!/bin/bashfor index in $(list of indexes); do
documents=$(curl -XGET http://cluster:9200/${index}/_count 2>/dev/null | cut -f 2 -d : | cut -f 1 -d ',')

if [ $counter -lt 4000000 ]; then
shards=1
elif [ $counter -lt 5000000 ]; then
shards=2
else
shards=$(( $counter / 5000000 + 1))
fi

new_version=$(( $(echo ${index} | cut -f 1 -d _) + 1))
index_name=$(echo ${index} | cut -f 2 -d _)

curl -XPUT http://cluster:9200/${new_version}${index_name} -d '{
"number_of_shards" : '${shards}'
}'
curl -XPOST http://cluster:9200/_reindex -d '{
"source": {
"index": "'${index}'"
},
"dest": {
"index": "'${new_version}${index_name}'"
}
}'
done

Once you've reindexed, you're ready to move the alias to the right index and delete the old one.

Photo: Duncan C.

If you found this article helpful please tap or click “♥︎”, follow me on Twitter orsubscribe to my Engineering Weekly newsletter.

--

--

I can perform under pressure, but not Bohemian Rhapsody. CTO at Data Impact by NielsenIQ. Ex VP @Ledger & @Aircall.