Reinventing virtualization with the AWS Nitro System

https://www.allthingsdistributed.com/2020/09/reinventing-virtualization-with-aws-nitro.html

最开始在ec2上面做虚拟化,使用的是Xen技术,有30%的开销在虚拟机,资源管理比如网络磁盘,以及安全监控上。

In the early days of EC2, we used the Xen hypervisor, which is purely software-based, to protect the physical hardware and system firmware; virtualize the CPU, storage, and networking; and provide a rich set of management capabilities. But with this architecture, as much as 30% of the resources in an instance were allocated to the hypervisor and operational management for network, storage, and monitoring.

nitro其实要做的事情就是将资源管理以及安全监控放在专有的硬件上面,而虚拟机则针对这种定制化的硬件进行优化。

The Nitro System is comprised of three main parts: the Nitro Cards, the Nitro Security Chip, and the Nitro Hypervisor. The Nitro Cards are a family of cards that offloads and accelerates IO for functions, including Virtual Private Cloud (VPC), Elastic Block Store (EBS), and Instance Storage, thereby increasing overall system performance.

好处是: 1. 使用专有硬件开销会更小 2. 网络和磁盘延迟的抖动会更小 3. 更好的安全性。


至于为什么将资源管理放在专有硬件上开销会更小呢? 评论下面有人这么说的。原先这些开销都是放在intel/amd的x86芯片上做的,如果我们把这些工作迁移到价格低廉的arm芯片上,那么价钱就会便宜很多。注意这些工作并不是客户的应用程序,所以他们不关心是跑在x86还是arm上。

There's a couple of things going on here. One is that most EC2 instances are Intel, or AMD, whereas the Annapurna Labs chips on the Nitro cards are Arm based, and importantly for Amazon: much cheaper. This Hacker News thread has estimates of $30 for the Arm chip in Apple's iPhone. That compares to hundreds, or thousands of dollars for an Intel/AMD server chip. So just by offloading some of their workload from Intel/AMD to Arm, Amazon can save a lot of money. They spend billions of dollars on CAPEX per year, so even tens to hundreds of millions of dollars for "new hardware technology" can be worthwhile.

Secondly, since the Nitro Cards are basically Arm chips, they're not so much "new hardware technology". They are an interesting and novel solution to a problem using commodity parts. Arm chips are used everywhere from phones to cars to random household gadgets, they can be produced very cheaply. There were 6 billion of them shipped in Q4 2019.

关于网络和磁盘抖动,评论里面也有很提到,虚拟机对IO延迟影响很大。所以用专门的芯片来做,会是更好的解决方案。

Traditional hypervisors don't cut it for demanding workloads. I learned this first hand in 2013 when I was building a large-scale message queueing system on VMware vSphere. The IO from VM to NFS mounts had so much jitter that my spill-to-disk processes kept crashing. One must optimize down to the lowest levels of the stack to get the price-to-performance outcome.


评论区里面还提到了google/ms和amazon的文化对比,我摘抄几段过来

关于Customer Obsession的

This is an interesting point. I have a bunch of senior ex-Googlers on my team these days and one of the topics we discuss is "why did AWS beat Google Cloud even though they started at similar times and Google had more and better engineers with more experience of doing cloud-y things at global scale". This is especially personal for me because when I joined AWS in 2008, I did so expecting Google and Microsoft to crush us for exactly this reason (well, and their massive margins from other businesses allowing them to invest more, which I had to read "Innovators Dilemma" to understand that was a curse as much as a blessing).

My point of view, is when Amazonian's talk so much about their value of "Customer Obsession", outsiders really don't get it, as most company's have an equivalent value, and its just something they take as meaning "don't screw the people who are paying us so they don't want to pay us any more". Whereas, what it means in practice at Amazon is much stronger: "while we have think big visions about what our ideal system/business would be for us to innovate on and operate, we aren't going to force our customers into it at our pace; instead we are going to work backwards from what they want to end up with a compromise. And we will hold to that, even if it makes our employees jobs less satisfying in the meantime in terms of how much they get to innovate".

关于Engineering Culture的

This is a couple of good points too. I know people who worked pretty closely with Diane and Urs, and I got to observe Jassy pretty closely my last year on Nitro. Jassy breathes customer obsession. And I think (again through second hand conversations) Urs' biggest problem in this space is very much his lack of it - like a lot of ex-engineering leaders he wanted to move the world to the ideal (which Google had internally in many ways), but no strategy other than a massive "pretend theres no computer" sandbox (early AppEngine) then just trying to expose the inside ideal with an outside API, then finally just copying AWS.

Diane had to start from a culture built around that, and it sounded like she struggled with typical "change the culture" stuff, which sounds trite but as any manager would tell you, changing engineering cultures is incredibly hard, let alone trying to do it in a division of a company as successful in other parts as Google.

关于Software Engineer应该是什么样的

The "more and better engineers" is also interesting. Joy Ganguly, who was the tech lead of Nitro in 2013 and a massive heavy lifter in making the thing succeed, came to us from pure software development on Microsoft's Hyper-V. He once said something off the cuff comparing the companies engineers: "you guys are much more plumbers than software developers, and I like it". This comes to my "employees jobs less satisfying" comment. Any Amazon engineer/manager knows they hire a bunch of software developers, and then they all spend less time writing new and substantial code on new systems, and more time plumbing doing fixes and devops style scaling of existing systems. And I (like you it seems) believe that is the way you agilely build world class global systems. Whereas at Google of course, they protect a lot of their software developers from that through SRE, with both sides combining to do things more right in the long term (e.g. GCE is way cleaner as an API than EC2), but a lot slower than their customers would like in the mean time.