How Yelp Runs Millions of Tests Every Day

Seagull(海鸥)这个系统可以让许多test suites可以快速地并行完成,最终加快开发迭代和代码部署速度。

Seagull is built using the following:


There are around 300 seagull-runs every day with 30-40 per hour at peak time. They launch more than 2 million Docker containers in a day. To handle this, we need to have around 10,000 CPU cores in our seagull cluster during peak hours.(可以遇见到在EC2上花费会非常高)

To maintain the timeliness of our test suite, especially at peak hours, we need to have hundreds of instances always available in Seagull Cluster. For a while we were using AWS ASGs with AWS On-Demand Instances but fulfilling this capacity was very expensive for us. (即便使用了ASG + On Deman实例费用依然很高)

To reduce costs, we started using an internal tool, called FleetMiser, to maintain the Seagull Cluster. FleetMiser is an auto-scaling engine which we built to scale a cluster based on different signals such as current cluster utilization, number of runs in pipeline, etc. It has 2 main components:(自研FleeMiser系统根据多种信号来触发Auto Scaling,并且使用的是Spot-Instance)

FleerMiser saved us ~80% in cluster cost. Before FleetMiser, the cluster was completely on AWS On-Demand Instances with no auto scaling.