What have been Facebook’s greatest technical accomplishments?
Answer by Robert Johnson:
I ran the infrastructure software team at facebook for five years, and was involved in most of the projects listed in the other answers. I consider the greatest accomplishment while I was there to be the memcache/mysql cluster. When I left Facebook a year ago the cluster held over a Trillion (yes that’s a T ) objects with over a billion requests a second, usually taking less than a millisecond. It maintained consistency with a high rate of writes over many geographically distributed datacenters, and had extremely little downtime.
The real accomplishments don’t actually have as much to do with memcache or mysql as you might think - over time these will probably be replaced with newer “technologies”, but the real technology is all the things you have to do to get this massive number of machines to work together in a way that’s fast and reliable. This is not what people usually mean when they ask “what technology do you use?” but it’s where the interesting innovation happens. This ranges from algorithmic things like schemes for sharding, partitioning, and caching data, and keeping distributed data consistent, to mundane sounding things like deployment and monitoring, which are not mundane at all at this scale.
Here are a few of the specific challenges we overcame:
Consistency across datacenters -Facebook is a real-time application so changes that happen in one part of the world must be immediately visible in all other parts of the world. It also has a surprisingly high bar for consistency. I hear a lot of people outside Facebook say “oh, it’s just a fun site, it doesn’t matter if it’s consistent”, but if things start showing up out of order or disappearing, users get very unhappy very fast. Here’s an old blog post from when we built our first geographically distributed datacenter in 2007:
Looking back, this scheme might sound a bit hacky, but it worked and it kept us scaling. The setup today is considerably more sophisticated.
Network flow - pages on Facebook require many small pieces of data that are not easily clustered, so the pattern we usually see is one server requesting a large number of small objects from a large number of other servers. The problem is that if all the servers reply at the same time, and you get a large burst of packets through the requesting server’s rack switch and NIC, and a packet gets dropped. This is called “TCP incast” in academic literature (although you get the same basic problem with udp) and the way we solved it was with throttling on the machine sending the requests.
The network problems get even worse when there are failures. The way most software deals with not getting a reply from another server is to send another packet. Unfortunately a common reason for not getting a reply is that the other server is overloaded. So when a server gets too overloaded to reply in time, all of the sudden the traffic to it doubles because of the retries. We spent a lot of time on algorithms that would seamlessly deal with small failures where retries work, but not spiral out of control during large failures, when retries just make things worse.
Cache Layout - There are a lot of things to balance here - if you have big objects you want to spread them across machines so you can read them in parallel, but with small objects you want them co-located so one RPC call gets you multiple objects. Most of Facebook is in the small object end of things, so we played a lot of tricks to improve our rate of objects per RPC. A lot of this had to do with separating objects with different workloads so we could tune them differently. We also spent a lot of time figuring out what were the most cost-effective things to keep in memory, and when it made sense to denormalize things. (most of the time in practice it turned out that denormalizing didn’t help)
Handling Failures - As I mentioned in the network section, there’s an interesting pattern where things that are great at covering up small problems tend to make big problems worse. For example if I have an algorithm that sends a request to a random server, and if it doesn’t get a reply it sends it to a different random server, until it gets an answer. This works great when you lose one or two machines, but it’s a disaster if you lose half the machines. All the sudden the load doubles on the remaining machines, and there’s a pretty good chance the reason you lost half the machines in the first place was that the load was too high. Instead what you have to do is detect overload conditions and shed load until you’re running whatever is still working near capacity. It’s important to remember that it’s a real-time system in the computer science sense of the term: a late answer is a wrong answer. People never feel good about dropping a request, but it’s often the best way to maximize the number of right answers when there’s trouble. Another common pattern is when something gets slow it builds up a large queue and slows everything else down, and again the answer is to shed load. It can be a tricky algorithm because you might need a deep queue in normal operation to smooth out momentary bursts of traffic.
Deployment and Monitoring - Another subject that’s been written about extensively in other places, so I won’t write much here. Suffice it to say that if machines disagree about who’s doing what, things get really ugly really fast. Also, the single best opportunity to bring down every single machine in your cluster is when you’re changing every machine in your cluster with your shiny new software. So strategies here are all about doing things in stages, monitoring them well, and keeping them contained.
Improving Memcache and MySql
This is what most people think of when we talk about the database/cache cluster. We did a ton of work in memcache to improve throughput - lots of profiling and fixing issues one at a time. Most of what it does is in the network stack, so a lot of this work actually happened in the linux kernel:
In MySql it’s all about getting the data laid out on disk in a reasonable way, and getting it to cache the most useful stuff in memory. Mark Callaghan’s blog has a ton of great information:
I wrote this about the principles we followed while building this: