Cloud computing and Amazon’ S3 down-time – an analysis

Amazon Web Services » Service Health Dashboard » Amazon S3 Availability Event: July 20, 2008
Amazon S3 Availability Event: July 20, 2008

We wanted to provide some additional detail about the problem we experienced on Sunday, July 20th.

At 8:40am PDT, error rates in all Amazon S3 datacenters began to quickly climb and our alarms went off. By 8:50am PDT, error rates were significantly elevated and very few requests were completing successfully. By 8:55am PDT, we had multiple engineers engaged and investigating the issue. Our alarms pointed at problems processing customer requests in multiple places within the system and across multiple data centers. While we began investigating several possible causes, we tried to restore system health by taking several actions to reduce system load. We reduced system load in several stages, but it had no impact on restoring system health.

At 9:41am PDT, we determined that servers within Amazon S3 were having problems communicating with each other. As background information, Amazon S3 uses a gossip protocol to quickly spread server state information throughout the system. This allows Amazon S3 to quickly route around failed or unreachable servers, among other things. When one server connects to another as part of processing a customer’s request, it starts by gossiping about the system state. Only after gossip is completed will the server send along the information related to the customer request. On Sunday, we saw a large number of servers that were spending almost all of their time gossiping and a disproportionate amount of servers that had failed while gossiping. With a large number of servers gossiping and failing while gossiping, Amazon S3 wasn’t able to successfully process many customer requests.

At 10:32am PDT, after exploring several options, we determined that we needed to shut down all communication between Amazon S3 servers, shut down all components used for request processing, clear the system’s state, and then reactivate the request processing components. By 11:05am PDT, all server-to-server communication was stopped, request processing components shut down, and the system’s state cleared. By 2:20pm PDT, we’d restored internal communication between all Amazon S3 servers and began reactivating request processing components concurrently in both the US and EU.

At 2:57pm PDT, Amazon S3’s EU location began successfully completing customer requests. The EU location came back online before the US because there are fewer servers in the EU. By 3:10pm PDT, request rates and error rates in the EU had returned to normal. At 4:02pm PDT, Amazon S3’s US location began successfully completing customer requests, and request rates and error rates had returned to normal by 4:58pm PDT.

We’ve now determined that message corruption was the cause of the server-to-server communication problems. More specifically, we found that there were a handful of messages on Sunday morning that had a single bit corrupted such that the message was still intelligible, but the system state information was incorrect. We use MD5 checksums throughout the system, for example, to prevent, detect, and recover from corruption that can occur during receipt, storage, and retrieval of customers’ objects. However, we didn’t have the same protection in place to detect whether this particular internal state information had been corrupted. As a result, when the corruption occurred, we didn’t detect it and it spread throughout the system causing the symptoms described above. We hadn’t encountered server-to-server communication issues of this scale before and, as a result, it took some time during the event to diagnose and recover from it.

During our post-mortem analysis we’ve spent quite a bit of time evaluating what happened, how quickly we were able to respond and recover, and what we could do to prevent other unusual circumstances like this from having system-wide impacts. Here are the actions that we’re taking: (a) we’ve deployed several changes to Amazon S3 that significantly reduce the amount of time required to completely restore system-wide state and restart customer request processing; (b) we’ve deployed a change to how Amazon S3 gossips about failed servers that reduces the amount of gossip and helps prevent the behavior we experienced on Sunday; (c) we’ve added additional monitoring and alarming of gossip rates and failures; and, (d) we’re adding checksums to proactively detect corruption of system state messages so we can log any such messages and then reject them.

Finally, we want you to know that we are passionate about providing the best storage service at the best price so that you can spend more time thinking about your business rather than having to focus on building scalable, reliable infrastructure. Though we’re proud of our operational performance in operating Amazon S3 for almost 2.5 years, we know that any downtime is unacceptable and we won’t be satisfied until performance is statistically indistinguishable from perfect.

Sincerely,

The Amazon S3 Team

Conditions of Use | Privacy Notice © 2006-2008 Amazon Web Services LLC or its affiliates. All rights reserved.

The Carambola tree….

This tree, currently about 4ft. in height (but grows to around 33′!) is now ensconced on the front terrace. I decided it looked plain, so hied off to the nearest Home Depot and brought this back, unaware that it produces the star fruit as well!

Anyway, let’s see how it does over the next few months….

Carambola

From Wikipedia, the free encyclopedia

Jump to: navigation, search

Averrhoa carambola
Carambola fruits

Carambola fruits
Scientific classification
Kingdom: Plantae
Division: Magnoliophyta
Class: Magnoliopsida
Order: Oxalidales
Family: Oxalidaceae
Genus: Averrhoa
Species: A. carambola
Binomial name
Averrhoa carambola

L.

Carambolas still on the tree

Carambolas still on the tree

The carambola is a species of tree native to Indonesia, India and Sri Lanka and is popular throughout Southeast Asia, Trinidad, Malaysia and parts of East Asia. It is also grown throughout the tropics. Carambola is commercially grown in the United States in south Florida and Hawaii, for its fruit, known as the starfruit. It is closely related to the bilimbi.

Contents

[edit] Health risks

Individuals with kidney trouble should avoid consuming the fruit, because of the presence of oxalic acid. Juice made from carambola can be even more dangerous owing to its concentration of the acid. It can cause hiccups, vomiting, nausea, and mental confusion. Fatal outcomes after ingestion of star fruits have been described in uraemic patients.[1] [2]

[edit] Drug interactions

Like grapefruit, star fruit is considered to be a potent inhibitor of seven cytochrome P450 isoforms.[3][4] These enzymes are significant in the first pass elimination of many medicines, and thus the consumption of star fruit or its juice in combination with certain medications can significantly increase their effective dosage within the body. Research into grapefruit juice has identified a number of common medications affected, including statins which are commonly used to treat cardiovascular illness, benzodiazepines (a tranquilizer family including diazepam) as well as other medicines.[5] These interactions can be fatal if an unfortunate confluence of genetic, pharmacological, and lifestyle factors results in, for instance, heart failure, as could occur from the co-ingestion of star fruit or star fruit juice with atorvastatin (Lipitor).

[edit] History

The star fruit originally came from Sri Lanka and the Moluccas. For the past several hundred years, it has been cultivated in Malaysia.[6]

[edit] Gallery

[edit] References

  1. ^ Chang JM et al., Am J Kidney Dis 2000;35:189.
  2. ^ http://www.nutritionatc.hawaii.edu/HO/2003/202.htm
  3. ^ Abstracts: Metabolism and metabolic enzymes studies for the 8th National Congress on Drug and Xenobiotic Metabolism in China
  4. ^ Potential Drug-Food Interactions with Pomegranate Juice
  5. ^ P450 Table
  6. ^ Star Fruit, Carambola – star fruit facts – Food Reference

[edit] External links

Wikimedia Commons has media related to: