10 Lessons from 10 Years of Amazon Web Services

Table of Contents

http://www.allthingsdistributed.com/2016/03/10-lessons-from-10-years-of-aws.html

The epoch of AWS is the launch of Amazon S3 on March 14, 2006, now almost 10 years ago.

1. Build evolvable systems

系统需要不断迭代和重新设计

Almost from day one, we knew that the software we were building would not be the software that would be running a year later. The expectation was that with each order or two of magnitude, we would need to revisit and revise the architecture to make sure we could address the issues of scale.

But we couldn't adopt the old style approach of upgrading systems through a maintenance outage, as many businesses around the world are relying on our platform for 24/7 availability. We needed to build such an architecture that we could introduce new software components without taking the service down. Marvin Theimer, Amazon Distinguished Engineer, once jokingly said that the evolution of Amazon S3 could best be described as starting off as a single engine Cessna plane, but over time the plane was upgraded to a 737, then a group of 747s, all the way to the large fleet of Airbus 380s that it is now. All the while, we were refueling in midair and moving customers from plane to plane without them even realizing it.

2. Expect the unexpected

Fault-tolerant应该放在系统设计的第一位.

Failures are a given and everything will eventually fail over time: from routers to hard disks, from operating systems to memory units corrupting TCP packets, from transient errors to permanent failures. This is a given, whether you are using the highest quality hardware or lowest cost components.

This becomes an even more important lesson at scale: for example, as S3 processes trillions and trillions of storage transactions, anything that has even the slightest probability of error will become realistic. Many of those failure scenarios can be anticipated beforehand, but many more are unknown at design and build time.

We needed to build systems that embrace failure as a natural occurrence even if we did not know what the failure might be. Systems need to keep running even if the "house is on fire." It is important to be able to manage pieces that are impacted without the need to take the overall system down. We've developed the fundamental skill of managing the "blast radius" of a failure occurrence such that the overall health of the system can be maintained.

3. Primitives not frameworks

设计SaaS要注意的. 用户的想象力是无穷的, 不要通过frameowork去限定他们. 但是好像AWS现在也开始推出一些framework服务了比如Elastic Beanstalk, 但是这些肯定都是以primitive service为基础的.

Pretty quickly, we started to realize that the way customers would like to use our services was a work in progress. When customers left the constraining, old world of IT hardware and datacenters behind, they started to develop systems with new and interesting usage patterns that no one had ever seen before. As such, we needed to be ultra-agile to make sure we were catering to our customers' needs.

One of the most important mechanisms we provided was to offer customers a collection of primitives and tools, where they could pick and choose their preferred way to engage with the AWS cloud, instead of only providing one framework that they are forced to use, which includes everything and the kitchen sink. This approach has enabled our customers to become so successful, that even later generations of AWS services make use of exactly the same primitive services our customers have become accustomed to.

It is also important to realize that it is hard to predict what certain priorities are for your customers until they have the service in their hands and actually start building with it. This is why we deliver new services often with a minimal feature set and allow our customers to help drive the roadmap for extending the service with new features.

4. Automation is key

SaaS平台都必须实现高度自动化. 而高度自动化的基础就是服务API化, 所有的服务都要能找到对应的API.

Developing software services that need to be operated is radically different from building software that needs to be shipped to customers. Managing systems at scale requires a very different mindset to ensure that we meet the reliability, performance, and scalability expectations of our customers.

A key mechanism to achieve this is to automate the management as much as possible, removing error prone, manual operations. To do this, we needed to build management APIs that control the key functionality of our operations. AWS helps its customers do this too. By decomposing your applications into essential building blocks, each with a management API, you can apply automation rules to maintain reliable and predictable performance at scale. A good litmus test has been that if you need to SSH into a server or an instance, you still have more to automate.

5. APIs are forever

This was a lesson we had already learned from our experiences with Amazon retail, but it became even more important for AWS's API-centric business. Once customers started building their applications and systems using our APIs, changing those APIs becomes impossible, as we would be impacting our customer's business operations if we would do so. We knew that designing APIs was a very important task as we'd only have one chance to get it right.

6. Know your resource usage

建立收费模型的时候, 一定要弄清楚自己的资源使用量, 尤其是在服务通过薄利多销(high volume - low margin)的方式赚钱的时候.

When building a financial model for a service to identify the appropriate charging model, be sure to have good data about the cost of the service and its operations, especially for running a high volume – low margin business. AWS needed to be very conscious as a service provider about our costs so that we could afford to offer our services to customers and identify areas where we could drive operational efficiencies to cut costs further, and then offer those savings back to our customers in the form of lower prices.

An example in the early days where we did not know the resources required to serve certain usage patterns was with S3: We had assumed that the storage and bandwidth were the resources we should charge for; after running for a while, we realized that the number of requests was an equally important resource. If customers have many tiny files, then storage and bandwidth don't amount to much even if they are making millions of requests. We had to adjust our model to account for the all the dimensions of resource usage so that AWS could be a sustainable business.

7. Build security in from the ground up

服务安全性从第一天就要开始考虑, 并且始终要纳入系统设计的考虑当中.

Protecting your customers should always be your number one priority, and it certainly has been for AWS… from both an operational perspective as well as tools and mechanisms; it will forever be our number one investment area.

One approach that we learned quickly is that to build secure services, it is necessary to integrate security at the very beginning of service design. The security team is not a group that does validation after something has been built. They must be partners on day one to make sure that security is fundamentally rock solid from the ground up. There is no compromise when it comes to security

8. Encryption is a first-class citizen

存储系统要允许数据加密. 加密的方式可以是存储系统自己提供, 让客户包来保管加密的key.

Encryption is a key mechanism for customers to ensure that they are in full control over who has access to their data. Ten years ago, tools and services for encryption were hard to use and it wasn’t until a few years into our operations that we learned how to best integrate encryption into our services.

It started by providing server-side encryption in S3 for compliance use cases. If you would inspect any disks in our datacenters, none of the data would be accessible. But with the launch of Amazon CloudHSM (for hardware security models) and later Amazon Key Management Service, customers could use their own keys for encryption, which removed the need for AWS to manage their keys.

For some time now, support for encryption has been integrated at the design phase of each new service. For example, in Amazon Redshift, each of the data blocks is encrypted by default with a random key and the collection of these random keys is again encrypted with a master key. The master key can be provided by customers, ensuring that they are the only ones who can decrypt and have access to their critical business data or personal identifiable information.

Encryption continues to be a high priority for our business. We will continue to make it even easier for our customers to make use of encryption so they can better protect themselves and their customers.

9. The importance of the network

网络的重要性不言而喻, 体现在安全性和性能两个方面. aws自己设计的网络硬件和软件改善了网络虚拟化方面的表现和性能.

AWS has come to support many different workloads; from high-volume transaction processing to video transcoding at scale, from high-performance parallel computing to massive web site traffic. Each of those workloads have unique requirements when it comes to the network.

AWS has developed a unique skill to innovate datacenter layout and operations, such that we can have flexible network infrastructure that can be adapted to meet the needs of our customers’ workloads, whatever they may be. We have learned over time that we should not be afraid to develop our own hardware solutions to ensure our customers can achieve their goals. This enables us to meet our very specific requirements, such as the ability to isolate AWS customers from each other on the network to achieve the highest levels of security.

Another successful example of how AWS-designed networking hardware and software enabled us to further improve performance for our customers was in addressing the virtualization tax on network access from virtual machines. Because network access is a shared resource, customers previously could experience significant jitter on the network at times. Developing a NIC that supported single root IO virtualization allowed us to give each VM its own hardware virtualized NIC. This lowered latency more than 2x and delivered more than 10x improvement in latency variability on the network.

10. No gatekeepers

服务的开放性允许更多的使用场景出现在aws上, 反过来aws通过跟踪这些使用场景可以得到反馈来改进系统.

The AWS team has delivered many services and features over time to create a very broad and deep platform for our customers. But AWS is so much more than the services that we built in-house: a very rich ecosystem exists of services delivered by our partners that extends the platform into many new directions.

For example, we have partners like Stripe offering payment services to Twilio making telephony programmable on AWS. Many of our customers are also building platforms themselves on top of AWS to serve specific vertical needs: Philips is building their Healthsuite Digital Platform for managing healthcare data, Ohpen has built a platform for retail banking on AWS, Eagle Genomics has built a genomics processing platform, and many more. What’s essential is that there are no gatekeepers on the AWS platform that tell our partners what they can and cannot do. “No gatekeepers” liberates the innovative processes and opens the door for many unexpected inventions, which are sure to follow.