5 Lessons We’ve Learned Using AWS – Netflix TechBlog

https://medium.com/netflix-techblog/5-lessons-weve-learned-using-aws-1f2a28588e4c

Dorothy, you’re not in Kansas anymore.

AWS和自建的数据中心在环境上有很大差异,许多假设不再成立。

Many examples come to mind, such as hardware reliability. In our own data centers, session-based memory management was a fine approach, because any single hardware instance failure was rare. Managing state in volatile memory was reasonable, because it was rare that we would have to migrate from one instance to another. I knew to expect higher rates of individual instance failure in AWS, but I hadn’t thought through some of these sorts of implications.

Another example: in the Netflix data centers, we have a high capacity, super fast, highly reliable network. This has afforded us the luxury of designing around chatty APIs to remote systems. AWS networking has more variable latency. We’ve had to be much more structured about “over the wire” interactions, even as we’ve transitioned to a more highly distributed architecture.

Co-tenancy is hard.

AWS上面的资源是共享出来的,所以许多组件的延迟和吞吐波动性会比较大。

When designing customer-facing software for a cloud environment, it is all about managing down expected overall latency of response. AWS is built around a model of sharing resources: hardware, network, storage, etc. Co-tenancy can introduce variance in throughput at any level of the stack. You’ve got to either be willing to abandon any specific subtask, or manage your resources within AWS to avoid co-tenancy where you must.

The best way to avoid failure is to fail constantly.

服务设计上要支持降级;经常使用Chaos Monkey来做容灾演习。

Netflix: Continually Test By Failing Servers With Chaos Monkey

In 5 Lessons We’ve Learned Using AWS, Netflix's John Ciancutti says the best way to avoid failure is to fail constantly. In the cloud it's expected instances can fail at any time, so you always have to be prepared. In the real world we prepare by running drills. Remember all those exciting fire drills? It's not just fire drills of course. The military, football teams, fire fighters, beach rescue, virtually any entity that must react quickly and efficiently to disaster hones their responsiveness by running drills.

Learn with real scale, not toy models.

通过回放线上流量,不仅有助于评估AWS,还可以发现切换到AWS可能出现的瓶颈。

Before we committed ourselves to AWS, we spent time researching the platform and building test systems within it. We tried hard to simulate realistic traffic patterns against these research projects.

This was critical in helping us select AWS, but not as helpful as we expected in thinking through our architecture. Early in our production build out, we built a simple repeater and started copying full customer request traffic to our AWS systems. That is what really taught us where our bottlenecks were, and some design choices that had seemed wise on the whiteboard turned out foolish at big scale.

Commit yourself.

从自建数据中心切换到AWS是件很大的工程,需要上下一心,全身心投入。

As you run into the hurdles, have the grit and the conviction to fight through them. Our CEO, Reed Hastings, has not only been fully on board with this migration, he is the person who motivated it! His commitment, the commitment of the technology leaders across the company, helped us push through to success when we could have chosen to retreat instead.