Wednesday, May 1, 2013

Outages - my observation



Amazon.com was down for a brief period last Monday. A few hours, give or take. Hacker News was the first to report it. Or, I got to know about the outage via HN.

The news item read – “Was Amazon down?” pointing to the Amazon Home Page. Chaos ensued. It triggered a debate. People raised questions about the infrastructure. Some of them made sense but the rest were just moo points talking about the lost revenue per minute, per hour et al. Outages are nothing new to companies like Amazon and eBay and when they occur there is also heavy revenue loss. Agreed. But it's not that these companies don't care about them.  

If you think about it, major global banks do have their own maintenance window [moratorium period] during which they suspend activity and run tests to ensure that things are running as expected. Many offer a limited range of services during such periods. To be fair, these companies don't have that luxury. You cannot display this "Sorry boss. We have exceeded our daily limit of 10000 users. Do login tomorrow to make a purchase" message to the 10001th user who'd logged in hoping to cash in on the discounts.

When I was with eBay, I had a chance to observe how the teams, in general, cared about outages. Having a server up and running 24X7 is of utmost priority to these sites. Or, for any e-commerce site which has a global usage for that matter as these sites largely depend on the number of visitors. For anyone to make a purchase, the site has to be up and running. Less outages translates to being able to serve even more customers which again translates to more revenue (at least technically). That’s the reason why these companies emphasize on having a Site Reliability, SWAT teams on their toes 24X7 to support outages of any kind.

That said, I vividly remember reading this article. The article analyzes downtime and performance of sites during the 2011 US Holiday season. If you look at it, both eBay and Amazon had an uptime of a staggering 100%. Mind blowing isn’t it?

So, I have a site which caters to a reasonable no. of audience across the globe. Now, how do I make sure that it's up and running all the time or with minimal downtimes. 

Companies like eBay and Amazon can afford to have the necessary equipment in place to begin with and teams across geographies to monitor their health. Also, with their scale and the number of servers, all it takes is to remove the machine from traffic so for the rest of the machines, its usual business. What it does is - it gives the support teams the time to figure out the issue and fix it. Setting up a team to monitor one or two servers is overkill. My friend was working on an internal service which was deployed in a Tomcat accessible only to a specified group. He wrote a simple Java utility which would ping the machine in periodic intervals to know if it’s up and running. He exposed it as a windows service. The problem lies with the midsized teams with say about 10-20 servers. How can they go about monitoring their system health without manual intervention?

May be they can build a dashboard like this. But it requires someone to hit the page to know the status of the system. One way would be to periodically monitor logs for any exceptions and to notify a concerned list. Anything else?

The larger picture - how to ensure that the services are available 24X7?.

Please pitch in with your ideas.

PS: I have used the term site and company interchangeably in this article.

Happy coding :)
~ cheers.!

2 comments:

  1. Most of these large companies use statistics to their advantage. They ride the MTTF wave all the time but don't succumb, primarily because of sheer numbers. One cluster of servers may fail, but all 100,000 servers failing simultaneously is an astronomically small probability. And they engineer their infrastructure to make sure that probability is as small as economically and humanly possible.

    In fact, they've gotten so good at this that they sell it as a service, a-la AWS. I was told by an Architect at Amazon that they run the entire site on AWS using exactly the same services that are available to the general public.

    ReplyDelete
  2. Point. Agreed.
    My point is these companies do take serious note of such one off outages. There is considerable amount of effort that goes into maintaining infrastructure of that scale.
    I am more concerned about the mid sized companies and the approach they adopt to ensure availability.

    ReplyDelete