Yesterday, Facebook went down for 2.5 hours. Normally this kind of news wouldn’t be covered here as it’s not that interesting. However, as it’s Facebook that has a good number of servers and infrastructure handling the millions/billions of requests per day we thought we’d mention what actually happened.
When a website goes down, like gadgetvenue for example, you just restart a service, quickly find out what happened, fix a script and off it goes again.
When Facebook went down, it was because of a problem with a script. Basically the problem was described as “an unfortunate handling of an error condition. An automated system for verifying configuration values ended up causing much more damage than it fixed.”
As Facebook has millions of users accessing profiles, sending messages, uploading pictures and a lot more it isn’t as simple as restarting a server (as there’s lots of them). Instead, each time a user got an error it interpreted it incorrectly and deleted a cache key. When it did this the stream of queries continued at a heavy pace effectively putting Facebook in a loop that it couldn’t recover from. This continued to happen even after the original problem was fixed and didn’t allow Facebook to recover.
To actually stop all the errors and problems, Facebook had to cut all traffic to the site and database cluster to let things become more stable. It appears it took 2.5 hours for things to right themselves.
Looking forwards, Facebook is now putting systems in place to prevent this from happening again.
Via: Geeky Gadgets and Facebook
Speak Your Mind
You must be logged in to post a comment.