Scaling a Web Service: Load Balancing

This post is going to look at one aspect of how sites like Facebook handle billions of requests and stay highly available: load balancing.

What is a load balancer?

A load balancer is a device that distributes work across many resources (usually computers). They are generally used to increase capacity and reliability.

In order to talk about load balancing in a general way, I make two assumptions about the service being scaled:

I can start as many instances of it as I want
Any request can go to any instance

The first assumption means that the service is stateless (or has shared state in something like a Redis cluster). The second assumption is not necessary in practice (because of things like sticky load balancing), but I assume it for the sake of simplicity in this post.

Here are the load balancing techniques I'm going to talk about:

Layer 7 load balancing (HTTP, HTTPS, WS)
Layer 4 load balancing (TCP, UDP)
Layer 3 load balancing
DNS load balancing
Manually load balancing with multiple subdomains
Anycast

and some miscellaneous topics at the end:

Latency and Throughput
Direct Server Return

These techniques are vaguely ordered as "steps" to take as a site gets more traffic. For example, Layer 7 load balancing would be the first thing to do (much earlier than Anycast). The first three techniques help with throughput and availability, but still have a single point of failure. The remaining three techniques remove this single point of failure while also increasing throughput.

To help us understand load balancing, we're going to look at a simple service that we want to scale.

Note: Each scaling technique is accompanied by a non-technical analogy to purchasing items at a store. These analogies are a simplified representation of the idea behind the technique and may not be entirely accurate.

The service

Let's pretend we're building a service that we want to scale. It might look something like this:

This system won't be able to handle a lot of traffic and if it goes down, the whole application goes down.

Analogy:

You go to the only checkout line in the store
You purchase your items
- If there isn't a cashier there, you can't make your purchase

Layer 7 load balancing

The first technique to start handling more traffic is to use Layer 7 load balancing. Layer 7 is the application layer. This includes HTTP, HTTPS, and WebSockets. A very popular and battle-tested Layer 7 load balancer is Nginx. Let's see how that'll help us scale up:

Note that we can actually load balance across tens or hundreds of service instances using this technique. The image above just shows two as an example.

Analogy:

An employee of the store directs you to a specific checkout line (with a cashier)
You purchase your items

Tools:

Nginx
HAProxy

Notes:

This is also where we'd terminate SSL

Layer 4 load balancing

The previous technique will help us handle a lot of traffic, but if we need to handle even more traffic, Layer 4 load balancing could be helpful. Layer 4 is the transport layer. This includes TCP and UDP. Some popular Layer 4 load balancers are HAProxy (which can do Layer 7 load balancing as well) and IPVS. Let's see how they'll help us scale up:

Layer 7 load balancing + Layer 4 load balancing should be able to handle more than enough traffic for most cases. However, we still need to worry about availability. The single point of failure here is the layer 4 load balancer. We'll fix this in the DNS load balancing section below.

Analogy:

There are separate checkout areas for people based on their membership number
- For example, if your membership number is divisible by two, go to the checkout near electronics, otherwise go to the one near food
Once you get to the right checkout area, an employee of the store directs you to a specific checkout line
You purchase your items

Tools:

HAProxy
IPVS

Layer 3 load balancing

If we need to scale up even more, we probably should add Layer 3 load balancing. This is more complicated than the first two techniques. Layer three is the network layer, which includes IPv4 and IPv6. Here's how Layer 3 load balancing could look:

To understand how this works, we first need a little background on ECMP (equal cost multi path routing). ECMP is generally used when there are multiple equal cost paths to the same destination. Very broadly, it allows the router or switch to send packets to the destination over different links (allowing for higher throughput).

We can exploit this to do L3 load balancing because, from our perspective, every L4 load balancer is the same. This means we can treat each link between the L3 load balancer and a L4 load balancer as a path to the same destination. If we give all of these load balancers the same IP address, we can use ECMP to split our traffic amongst all the L4 load balancers.

Analogy:

There are two separate identical stores across the street from each other. The one you go to depends on your dominant hand
Once you get to the right store, there are separate checkout areas for people based on their membership number
- For example, if your membership number is divisible by two, go to the checkout near electronics, otherwise go to the one near food
Once you get to the right checkout area, an employee of the store directs you to a specific checkout line
You purchase your items

Tools:

This is usually done in hardware with top-of-rack switches

TL;DR:

Unless you're running at a huge scale or have your own hardware, you don't need to do this

DNS load balancing

DNS is the system that translates names to IP addresses. For example, it may translate example.com to 93.184.216.34. It can also return multiple IP addresses as shown below:

If multiple IPs are returned, the client will generally use the first one that works (however, some implementations will only look at the first returned IP).

There are many DNS load balancing techniques including GeoDNS and round-robin. GeoDNS returns different responses based on who requests it. This lets us route clients to the server or datacenter that's closest to them. Round-robin returns different IPs for each request, cycling through all the available IP addresses. If multiple IPs are available, both of these techniques just change the ordering of the IPs in the response.

Here's how DNS load balancing would work:

In this example, different users are being routed to different clusters (either randomly or based on their location).

Now there isn't a single point of failure (assuming there are multiple DNS servers). To be even more reliable, we can run multiple clusters in different datacenters.

Analogy:

You check online for a list of shopping complexes the store operates in. The list puts the closest shopping complexes first. You look up directions to each one and go to the first open one in the list.
There are two separate identical stores across the street from each other. The one you go to depends on your dominant hand
Once you get to the right store, there are separate checkout areas for people based on their membership number
- For example, if your membership number is divisible by two, go to the checkout near electronics, otherwise go to the one near food
Once you get to the right checkout area, an employee of the store directs you to a specific checkout line
You purchase your items

Manual load balancing/routing

If our content is sharded across many servers or datacenters and we need to route to a specific one, this technique could be helpful. Let's say cat.jpg is stored in a cluster in London, but not any other clusters. Similarly, let's say dog.jpg is stored in NYC, but not in any other datacenters or clusters. This might happen when the content was just uploaded and hasn't been replicated across datacenters yet, for example.

However, users shouldn't have to wait for replication to complete in order to access the content. This means our application will need to temporarily direct all requests for cat.jpg to London and all requests for dog.jpg to NYC. So instead of https://cdn.example.net/cat.jpg we want to use https://lon-1e.static.example.net/cat.jpg. Similarly for dog.jpg.

To do this, we need to set up subdomains for each datacenter (and preferably each cluster and each machine). This can be done in addition to the DNS load balancing above.

Note: Our application will need to keep track of where the content is in order to do this rewriting.

Analogy:

You call the corporate office asking which locations carry cat food.
You look up directions to the locations and go to the first open one in the list
There are two separate identical stores across the street from each other. The one you go to depends on your dominant hand
Once you get to the right store, there are separate checkout areas for people based on their membership number
- For example, if your membership number is divisible by two, go to the checkout near electronics, otherwise go to the one near food
Once you get to the right checkout area, an employee of the store directs you to a specific checkout line
You purchase your items

Anycast

The final technique in this post is Anycast. First a little background:

Most of the internet uses Unicast. This essentially means each computer gets a unique IP address. There is another methodology called Anycast. With Anycast, multiple machines can have the same IP address and routers send requests to the closest one. We can combine this with the above techniques to have an extremely reliable and available system that can handle a lot of traffic.

Anycast basically allows the internet to handle part of the load balancing for us.

Analogy:

You tell people you're trying to go to the store and they direct you to the closest location
There are two separate identical stores across the street from each other. The one you go to depends on your dominant hand
Once you get to the right store, there are separate checkout areas for people based on their membership number
- For example, if your membership number is divisible by two, go to the checkout near electronics, otherwise go to the one near food
Once you get to the right checkout area, an employee of the store directs you to a specific checkout line
You purchase your items

Misc.

Latency and Throughput

As an aside, these techniques also work to increase the throughput of a low latency service. Instead of trying to make the service itself handle more traffic, add more of them. This way we'll have a low-latency, high-throughput system.

Direct server return

In a traditional load-balancing system, the request goes through all the layers of load balancing and the response returns through all of these layers as well. One optimization that can offload a lot of the traffic from the load balancers is direct server return. This means that the response from a server doesn't go back through the load balancers. If responses from the service are large, this is a very useful tool.

More information:

Facebook, Google, Netflix and most (if not all) large internet companies use these techniques at scale. Here are some great talks and more resources:

Patrick Shuff gave a talk explaining how Facebook uses these techniques to handle over 1 Billion users
CloudFlare has a brief primer on Anycast
Artur Bergman, the CEO of Fastly, gave a talk on how they scaled their CDN
Dave Temkin, the Director of Global Networks at Netflix, gave a talk on how they scaled the Netflix CDN

Please feel free to comment on Hacker News or email me if you have any questions!

If you want to be notified whenever I publish a new blog post, you can follow me on Twitter here.

Disclaimer: Although, I mention Facebook and Google in this post, none of the content included here is based on non-public information.

Random Trivia: Most of this post was written in February 2016, but I forgot to publish until now.

Header Image by Bob Mical