Tech Thoughts Avatar

What have been Facebook’s greatest technical accomplishments?

Answer by Robert Johnson:

I ran the infrastructure software team at facebook for five years, and was involved in most of the projects listed in the other answers. I consider the greatest accomplishment while I was there to be the memcache/mysql cluster. When I left Facebook a year ago the cluster held over a Trillion (yes that’s a T ) objects with over a billion requests a second, usually taking less than a millisecond. It maintained consistency with a high rate of writes over many geographically distributed datacenters, and had extremely little downtime.

The real accomplishments don’t actually have as much to do with memcache or mysql as you might think - over time these will probably be replaced with newer “technologies”, but the real technology is all the things you have to do to get this massive number of machines to work together in a way that’s fast and reliable. This is not what people usually mean when they ask “what technology do you use?” but it’s where the interesting innovation happens. This ranges from algorithmic things like schemes for sharding, partitioning, and caching data, and keeping distributed data consistent, to mundane sounding things like deployment and monitoring, which are not mundane at all at this scale.

Here are a few of the specific challenges we overcame:

Consistency across datacenters -Facebook is a real-time application so changes that happen in one part of the world must be immediately visible in all other parts of the world. It also has a surprisingly high bar for consistency. I hear a lot of people outside Facebook say “oh, it’s just a fun site, it doesn’t matter if it’s consistent”, but if things start showing up out of order or disappearing, users get very unhappy very fast. Here’s an old blog post from when we built our first geographically distributed datacenter in 2007:
Scaling Out | Facebook
Looking back, this scheme might sound a bit hacky, but it worked and it kept us scaling. The setup today is considerably more sophisticated.

Network flow - pages on Facebook require many small pieces of data that are not easily clustered, so the pattern we usually see is one server requesting a large number of small objects from a large number of other servers. The problem is that if all the servers reply at the same time, and you get a large burst of packets through the requesting server’s rack switch and NIC, and a packet gets dropped. This is called “TCP incast” in academic literature (although you get the same basic problem with udp) and the way we solved it was with throttling on the machine sending the requests.

The network problems get even worse when there are failures. The way most software deals with not getting a reply from another server is to send another packet. Unfortunately a common reason for not getting a reply is that the other server is overloaded. So when a server gets too overloaded to reply in time, all of the sudden the traffic to it doubles because of the retries. We spent a lot of time on algorithms that would seamlessly deal with small failures where retries work, but not spiral out of control during large failures, when retries just make things worse.

Cache Layout - There are a lot of things to balance here - if you have big objects you want to spread them across machines so you can read them in parallel, but with small objects you want them co-located so one RPC call gets you multiple objects. Most of Facebook is in the small object end of things, so we played a lot of tricks to improve our rate of objects per RPC. A lot of this had to do with separating objects with different workloads so we could tune them differently. We also spent a lot of time figuring out what were the most cost-effective things to keep in memory, and when it made sense to denormalize things. (most of the time in practice it turned out that denormalizing didn’t help)

Handling Failures - As I mentioned in the network section, there’s an interesting pattern where things that are great at covering up small problems tend to make big problems worse. For example if I have an algorithm that sends a request to a random server, and if it doesn’t get a reply it sends it to a different random server, until it gets an answer. This works great when you lose one or two machines, but it’s a disaster if you lose half the machines. All the sudden the load doubles on the remaining machines, and there’s a pretty good chance the reason you lost half the machines in the first place was that the load was too high. Instead what you have to do is detect overload conditions and shed load until you’re running whatever is still working near capacity. It’s important to remember that it’s a real-time system in the computer science sense of the term: a late answer is a wrong answer. People never feel good about dropping a request, but it’s often the best way to maximize the number of right answers when there’s trouble. Another common pattern is when something gets slow it builds up a large queue and slows everything else down, and again the answer is to shed load. It can be a tricky algorithm because you might need a deep queue in normal operation to smooth out momentary bursts of traffic.

Deployment and Monitoring - Another subject that’s been written about extensively in other places, so I won’t write much here. Suffice it to say that if machines disagree about who’s doing what, things get really ugly really fast. Also, the single best opportunity to bring down every single machine in your cluster is when you’re changing every machine in your cluster with your shiny new software. So strategies here are all about doing things in stages, monitoring them well, and keeping them contained.

Improving Memcache and MySql
This is what most people think of when we talk about the database/cache cluster. We did a ton of work in memcache to improve throughput - lots of profiling and fixing issues one at a time. Most of what it does is in the network stack, so a lot of this work actually happened in the linux kernel:
https://www.facebook.com/note.ph…

In MySql it’s all about getting the data laid out on disk in a reasonable way, and getting it to cache the most useful stuff in memory. Mark Callaghan’s blog has a ton of great information:
High Availability MySQL

Meta
I wrote this about the principles we followed while building this:
Scaling Facebook to 500 Million Users and Beyond
nisargam:

hahaha ! Startup ki hai, nibhani toh padegi !! Absolutely !

nisargam:

hahaha ! Startup ki hai, nibhani toh padegi !! Absolutely !

What Powers Instagram: Hundreds of Instances, Dozens of Technologies

(via instagram-engineering)

One of the questions we always get asked at meet-ups and conversations with other engineers is, “what’s your stack?” We thought it would be fun to give a sense of all the systems that power Instagram, at a high-level; you can look forward to more in-depth descriptions of some of these systems in the future. This is how our system has evolved in the just-over-1-year that we’ve been live, and while there are parts we’re always re-working, this is a glimpse of how a startup with a small engineering team can scale to our 14 million+ users in a little over a year. Our core principles when choosing a system are:

  • Keep it very simple
  • Don’t re-invent the wheel
  • Go with proven and solid technologies when you can

We’ll go from top to bottom:

OS / Hosting

We run Ubuntu Linux 11.04 (“Natty Narwhal”) on Amazon EC2. We’ve found previous versions of Ubuntu had all sorts of unpredictable freezing episodes on EC2 under high traffic, but Natty has been solid. We’ve only got 3 engineers, and our needs are still evolving, so self-hosting isn’t an option we’ve explored too deeply yet, though is something we may revisit in the future given the unparalleled growth in usage.

Load Balancing

Every request to Instagram servers goes through load balancing machines; we used to run 2 nginx machines and DNS Round-Robin between them. The downside of this approach is the time it takes for DNS to update in case one of the machines needs to get decomissioned. Recently, we moved to using Amazon’s Elastic Load Balancer, with 3 NGINX instances behind it that can be swapped in and out (and are automatically taken out of rotation if they fail a health check). We also terminate our SSL at the ELB level, which lessens the CPU load on nginx. We use Amazon’s Route53 for DNS, which they’ve recently added a pretty good GUI tool for in the AWS console.

Application Servers

Next up comes the application servers that handle our requests. We run Django on Amazon High-CPU Extra-Large machines, and as our usage grows we’ve gone from just a few of these machines to over 25 of them (luckily, this is one area that’s easy to horizontally scale as they are stateless). We’ve found that our particular work-load is very CPU-bound rather than memory-bound, so the High-CPU Extra-Large instance type provides the right balance of memory and CPU.

We use http://gunicorn.org/ as our WSGI server; we used to use mod_wsgi and Apache, but found Gunicorn was much easier to configure, and less CPU-intensive. To run commands on many instances at once (like deploying code), we use Fabric, which recently added a useful parallel mode so that deploys take a matter of seconds.

Data storage

Most of our data (users, photo metadata, tags, etc) lives in PostgreSQL; we’ve previously written about how we shard across our different Postgres instances. Our main shard cluster involves 12 Quadruple Extra-Large memory instances (and twelve replicas in a different zone.)

We’ve found that Amazon’s network disk system (EBS) doesn’t support enough disk seeks per second, so having all of our working set in memory is extremely important. To get reasonable IO performance, we set up our EBS drives in a software RAID using mdadm.

As a quick tip, we’ve found that vmtouch is a fantastic tool for managing what data is in memory, especially when failing over from one machine to another where there is no active memory profile already. Here is the script we use to parse the output of a vmtouch run on one machine and print out the corresponding vmtouch command to run on another system to match its current memory status.

All of our PostgreSQL instances run in a master-replica setup using Streaming Replication, and we use EBS snapshotting to take frequent backups of our systems. We use XFS as our file system, which lets us freeze & unfreeze the RAID arrays when snapshotting, in order to guarantee a consistent snapshot (our original inspiration came from ec2-consistent-snapshot. To get streaming replication started, our favorite tool is repmgr by the folks at 2ndQuadrant.

To connect to our databases from our app servers, we made early on that had a huge impact on performance was using Pgbouncer to pool our connections to PostgreSQL. We found Christophe Pettus’s blog to be a great resource for Django, PostgreSQL and Pgbouncer tips.

The photos themselves go straight to Amazon S3, which currently stores several terabytes of photo data for us. We use Amazon CloudFront as our CDN, which helps with image load times from users around the world (like in Japan, our second most-popular country).

We also use Redis extensively; it powers our main feed, our activity feed, our sessions system (here’s our Django session backend), and other related systems. All of Redis’ data needs to fit in memory, so we end up running several Quadruple Extra-Large Memory instances for Redis, too, and occasionally shard across a few Redis instances for any given subsystem. We run Redis in a master-replica setup, and have the replicas constantly saving the DB out to disk, and finally use EBS snapshots to backup those DB dumps (we found that dumping the DB on the master was too taxing). Since Redis allows writes to its replicas, it makes for very easy online failover to a new Redis machine, without requiring any downtime.

For our geo-search API, we used PostgreSQL for many months, but once our Media entries were sharded, moved over to using Apache Solr. It has a simple JSON interface, so as far as our application is concerned, it’s just another API to consume.

Finally, like any modern Web service, we use Memcached for caching, and currently have 6 Memcached instances, which we connect to using pylibmc & libmemcached. Amazon has an Elastic Cache service they’ve recently launched, but it’s not any cheaper than running our instances, so we haven’t pushed ourselves to switch quite yet.

Task Queue & Push Notifications

When a user decides to share out an Instagram photo to Twitter or Facebook, or when we need to notify one of our Real-time subscribers of a new photo posted, we push that task into Gearman, a task queue system originally written at Danga. Doing it asynchronously through the task queue means that media uploads can finish quickly, while the ‘heavy lifting’ can run in the background. We have about 200 workers (all written in Python) consuming the task queue at any given time, split between the services we share to. We also do our feed fan-out in Gearman, so posting is as responsive for a new user as it is for a user with many followers.

For doing push notifications, the most cost-effective solution we found was https://github.com/samuraisam/pyapns, an open-source Twisted service that has handled over a billion push notifications for us, and has been rock-solid.

Monitoring

With 100+ instances, it’s important to keep on top of what’s going on across the board. We use Munin to graph metrics across all of our system, and also alert us if anything is outside of its normal range. We write a lot of custom Munin plugins, building on top of Python-Munin, to graph metrics that aren’t system-level (for example, signups per minute, photos posted per second, etc). We use Pingdom for external monitoring of the service, and PagerDuty for handling notifications and incidents.

For Python error reporting, we use Sentry, an awesome open-source Django app written by the folks at Disqus. At any given time, we can sign-on and see what errors are happening across our system, in real time.

You?

If this description of our systems interests you, or if you’re hopping up and down ready to tell us all the things you’d change in the system, we’d love to hear from you. We’re looking for a DevOps person to join us and help us tame our EC2 instance herd.

Thoughts on Apple 5C & 5S

  • Last years technology, basically iPhone 5.
  • Unlocked prices still too high.
  • Only thing new are colours. BUt we have seen this somewhere before. Nokia & HTC had been trying to pull this off from last year.

    actually, adding colours made this ad meaningless. :P https://www.youtube.com/watch?v=lcaGD-n5KZk 
  • And basically it’s all about iOS 7.
  • It sounds like product created as response to losing market share & to compete on price.

Actually, iPhone 5S is the main deal. It’s seriously forward thinking. 64-bit, new sensors & fingerprint scanner. It’s actually a precursor to things to come. 

Actually, by introducing two new iPhone’s, I am sensing a good strategy here. iPhone 5C is a bet on what they already have & building upon it and iPhone 5S is bet on what they will have. Or maybe it’s just a way to re-cycle last year’s tech and at the same time introducing new tech. for future.

Next up iPad mini 2C. 

The difference between a vision and a hallucination is that other people can see the vision.
Marc Andreessen , creator of first widely used Web-Browser. (via nisargam)
Brilliant!

Brilliant!

I was once balancing a cleaning mop so that it can stand on it’s own. And while doing so, I observed something about it. When I am trying to balance it it would just stand there for a second or two before falling down. A thought crossed my mind about inertia, and product development or maybe an organization in general. It went like this. 

There are 3 types of motion in this context. 

  1.  Accelerating or Disrupting 
  2.  Inertial 
  3.  Stable. 

I put it in context to product development. 

1. Accelerating or Disrupting product development when accompanied with growth or moving towards growth is the best scenario. It keeps open a room for innovation and further development. Lot of things which one can feel genuinely excited about are happening in this stage. 

2. Inertial - Is a place in product development where lot of things have been done and it’s about to reach end of acceleration. That product is getting boring or stagnant. Or worse it’s just getting iterative updates. It’s like yellow signal. The present state is the product is moving forward on it’s previously gained momentum or one can say ‘basking on old glory’. It’s not or adding very little momentum to what it had already achieved. This state can really be delusional. In future, The product development can go anywhere, it is at a place where it can grow provided it get’s into accelerating phase or it may reach to dangerous 3rd phase ‘Stable’. Unless worked upon and changed it’s just like the cleaning mop which I am trying to balance, ready to fall. So it’s like caution stage. Everything might be just starting to fall apart. 

3. 'Stable' - The most dangerous stage. It’s like the product is like a sitting duck, ready to be shooted/hunted. Mostly products which are at this stage are already becoming obsolete. 

Protect your computer, anonymity & maintain your privacy while browsing the web.

If you are finding this post and have continued reading it after the title, I don’t really need to stress on the why the above things are important these days. So you can skip to my list of tools and utilities directly below.

Note: Post may be too long, you may not need to read everything, you can just skip the section and move on to with desired section.

Still, to sum it up all the tracking technologies which keeps a sharp eye upon our browsing habits and are maintaining an inventory of that data makes me concerned when it’s gathered unknown to our wishes.

Continue reading…

Google+ now let’s you set an animated gif as your profile pic. Google+ seriously got cool ! If only more of my friends were there. 

A goal of leading is to amplify your skills and experience while also growing new leaders. If you’re not giving people room to uncover their own way and ultimately solutions, then you’re creating a staff organization for you, not the next generation of leaders.
Steven Sinofsky (Read whole article here)

Merrill Tech Insider: Disk Defragment Explained

Good article about disk defragmentation. Check at above link.

To the user, the interface is the product.
- Aza Raskin

To the user, the interface is the product.

- Aza Raskin