Why Google Stores Billions of Lines of Code in a Single Repository

Table of Contents

http://cacm.acm.org/magazines/2016/7/204032-why-google-stores-billions-of-lines-of-code-in-a-single-repository/fulltext

Custom-built monolithic source repository, Google's "trunk-based development" strategy

1. Google-Scale

The Google codebase includes approximately one billion files and has a history of approximately 35 million commits spanning Google's entire 18-year existence. The repository contains 86TBa of data, including approximately two billion lines of code in nine million unique source files. The total number of files also includes source files copied into release branches, files that are deleted at the latest revision, configuration files, documentation, and supporting data files; see the table here for a summary of Google's repository statistics from January 2015. (google repo的统计数据)

Pasted-Image-20231225104037.png

In 2014, approximately 15 million lines of code were changed in approximately 250,000 files in the Google repository on a weekly basis. The Linux kernel is a prominent example of a large open source software repository containing approximately 15 million lines of code in 40,000 files. (和linux内核的commits数量对比)

Google's codebase is shared by more than 25,000 Google software developers from dozens of offices in countries around the world. On a typical workday, they commit 16,000 changes to the codebase, and another 24,000 changes are committed by automated systems. Each day the repository serves billions of file read requests, with approximately 800,000 queries per second during peak traffic and an average of approximately 500,000 queries per second each workday. Most of this traffic originates from Google's distributed build-and-test systems. (从read requests数量上看,google repo的规模级别确实很惊人)

2. Background

Piper and CitC.

  • piper最开始存储在bigtable上,后来搬到了spanner。(10 data centers, paxos)
  • 最开始使用的是perforce(on a single machine), 10年后切换到piper
  • piper支持文件级别的权限控制。除了一些机密算法相关外,代码对所有人都是可见的。
  • piper典型的workflow和git很相似, 通过citc客户端来进行访问。
  • citc(clients in the cloud)使用云端存储系统,然后用fuse包装成为本地文件系统。
  • citc使用cow方式修改文件, 这样developer不用把整个项目同步到本地。
  • 另外使用citc可以方便地把许多工具集成进来, 比如code search, code review.

Trunk-based development

  • 基于主线开发避免merge冲突以及依赖冲突等问题.
  • 定期地从trunk上拉出release. 重大bug的修改会cherry-pick到上面.
  • 通过条件变量来控制是否要使用new feature以及做A/B Test.

Google workflow

  • 构建工具(blaze)在代码提交阶段就发现错误
  • code owner(人工)确保代码质量
  • 静态分析工具(tricorder)来检查代码
  • 重构工具(rosie)来提出代码修改建议

3. Analysis

Advantages

  • Unified versioning, one source of truth;
  • Extensive code sharing and reuse;
  • Simplified dependency management;
  • Atomic changes;
    • A developer can make a major change touching hun- dreds or thousands of files across the repository in a single consistent operation.
    • For instance, a developer can rename a class or function in a single commit and yet not break any builds or tests.
  • Large-scale refactoring;
  • Collaboration across teams;
  • Flexible team boundaries and code ownership; and
  • Code visibility and clear tree structure providing implicit team namespacing.

Costs and trade-offs

  • Tooling investments for both development and execution;
  • Codebase complexity, including unnecessary dependencies and difficulties with code discovery; and
  • Effort invested in code health.

所有这些cost和trade-off都要求google必须在工具上有所跟进, 比如大规模的重构工具以及分布式编译环境等。

4. Alternatives

5. Conclusion

The monolithic model of source code management is not for everyone. It is best suited to organizations like Google, with an open and collaborative culture. It would not work well for organizations where large parts of the codebase are private or hidden between groups. (总之这玩意并不适合所有人,而且肯定也只适合google/facebook/ms这种愿意在工具上投入大量精力的公司)