The Anatomy Of The Google Architecture

Table of Contents @ 2009.12

1 The Google Philosophy

  • Jedis build their own lightsabres (the MS Eat your own Dog Food)
  • Parallelize Everything
  • Distribute Everything (to atomic level if possible)
  • Compress Everything (CPU cheaper than bandwidth) 优化带宽
  • Secure Everything (you can never be too paranoid)
  • Cache (almost) Everything
  • Redundantize Everything (in triplicate usually)
  • Latency is VERY evil

2 The Basic Glue


  1. Exterior Network (Perimeter Architecture) (外部接入层)
  2. Data Centre(数据中心)
  3. Rack Characteristics(机架设计)
  4. Core Server Hardware(硬件设计)
  5. Operating System Implementation(操作系统)
  6. Interior Network Architecture(内部网络架构)

2.1 Exterior Network


  • DNS Load Balanced splits traffic (country, .com multiple DNS, other X1) to FW
  • Firewall filters traffic (http/s, smtp,pop etc)
  • Netscalar Load Balancers take Request from FW blocks DOS attacks, ping floods (DOS) - blocks non IPv4/6 and none 80/443 ports and http multiplexes (limited caching capability)
  • User Request forwarded to Squid (Reverse Proxy) probably HUGE cache (Petabytes?)
    • 反向代理,似乎是穿透型的cache
    • 缓存命中率30-60%
    • All Image Thumbnails caches, much Multimedia cached, Expensive common queries cached 缩略图片,多媒体以及开销比较大的搜索
  • If not in Cache forwarded to GWS (Custom C++ Web Server) - now not using Custom apache?
  • GWS sends the Request to appropriate internal (Cell) servers

2.2 Data Centre

  • Last estimated were 36 Data Centers, 300+ GFSII Clusters and upwards of 800K machines.(36个数据中心,300+ GFS2集群, 80万机器
  • US (#1) - Europe (#2) - Asia (#3) - South America/Russia (#4)
  • Australia - on Hold
  • Future: Taiwan, Malaysia, Lithuania, and Blythewood, South Carolina.
  • Standard Google Modular DC (Cell) holds 1160 Servers / 250KW Power Consumption in 30 racks (40U).(cell有30个rack,支持40U one side.)
  • A Data Centre would consist of 100s of Modular Cells.(每个数据中心最多100左右个cell)
  • MDCs can also be deployed autonomously at the Perimeter (stand alone). MDC可以独立部署

2.3 Rack


  • Mini Server Size
    • Old Servers are Custom 1U
    • New Servers are 2U
    • seem 1/3 width of a normal 2U Server 宽度为普通2U服务器的1/3宽
  • 40U/80U Custom Racks (50% each side)
    • Huge Heating and Power Issues(冷却系统)
    • Optimized Motherboards(主板优化)
    • Have their own HW builds(定制硬件)
  • Motherboard directly mounted into Rack
    • servers have no casing - just bare boards(没有盖子)
    • assist with heat dispersal issues

2.4 Hardware

#note: 配置都非常普通

  • 2U Low-Cost (but not slow) Commodity Servers
    • 2009 Currently 2-Way, Dual Core/16GB/1-2TB +- Standard
    • Both Intel/AMD Chipsets - 1 NIC - 2 USB
    • Looks like they RAID1/mirror the disks for better I/O - read performance
    • SATA 7.2K/10K/15K drives? 8 x 2GB DDR3 ECC
  • Standard HW Build (Several HW Build Versions at any one time)
    • Currently at 7Gen Build (1G 2005 was probably Dual Core/SMP)
    • Each Server 12V Battery Backup and can run autonomously without external power (lasts 20-30s?)
YEAR Average Server Specification
1999/2000 PII/PIII 128MB+
2003/2004 Celeron 533, PIII 1.4 SMP, 2-4GB DRAM, Dual XEON 2.0/1-4GB/40-160GB IDE - SATA Disks via Silicon Images SATA 3114/SATA 3124
2006 Dual Opteron/Working Set DRAM(4GB+)/2x400GB IDE (RAID0?)
2009 2-Way/Dual Core/16GB/1-2TB SATA

2.5 Operating System

  • 100% Redhat Linux Based since 1998 inception
    • RHEL (Why not CentOS?)
    • 2.6.X Kernel
    • PAE(Physical Address Extension) 物理地址扩展,32位下面支持64GB内存
    • Custom glibc.. rpc… ipvs…
    • Custom FS (GFS II)
    • Custom Kerberos
    • Custom NFS
    • Custom CUPS
    • Custom gPXE bootloader #note: open-source network booting software
    • Custom EVERYTHING…..
  • Kernel/Subsystem Modifications
    • tcmalloc - replaces glibc 2.3 malloc - much faster! works very well with threads…
    • rpc - the rpc layer extensively modified to provide > perf increase < latency (52%/40%) #todo: ???
    • Significantly modified Kernel and Subsystems - all IPv6 enabled
    • Developed and maintained systems to automate installation, updates, and upgrades of Linux systems.
    • Served as technical lead of team responsible for customizing and deploying Linux to internal systems and workstations.
  • Use Python as the primary scripting language
  • Deploy Ubuntu internally (likely for the Desktop) - also Chrome OS base

2.6 Interior Network

Routing Protocol:

  • Internal network is IPv6 (exterior machines can be reached using IPv6)
  • Heavily Modified Version of OSPF as the IRP
  • Intra-rack network is 100baseT
  • Inter-rack network is 1000baseT
  • Inter-DC network pipes unknown but very fast


  • Juniper, Cisco, Foundry, HP, routers and switches


  • ipvs (ip virtual server)

3 The Major Glue


  • Google File System Architecture - GFS II
  • Google Database - Bigtable
  • Google Computation - Mapreduce
  • Google Scheduling - GWQ


  • GFS II “Colossus“ Version 2 improves in many ways (is a complete rewrite)
  • Elegant Master Failover (no more 2s delays…) master 2s内可以恢复
  • Chunk Size is now 1MB - likely to improve latency for serving data other than Indexing 偏向实时处理,chunksize=1MB
  • Master can store more Chunk Metadata (therefore more chunks addressable up to 100 million) = also more Chunk Servers 支持亿级别chunk


  • Increased Scalability (across Namespace/Datacenters)
    • Tablets spread over DC s for a table but expensive (both computationally and financially!) #note: 对于tablet跨数据中心的话代价非常大
  • Multiple Bigtable Clusters replicated throughout DC 数据中心之间的bigtable集群相互同步。
  • Current Status
    • Many Hundreds may be thousands of Bigtable Cells. Late 2009 stated 500 Bigtable clusters(2009年500个多个bigtable cluster)
    • At minimum scaled to many thousands of machine per cell in production 每个集群上面有上千台机器。
    • Cells manage Managing 3-figure TB data (0.X PB) 每个集群管理PB级别数据。


    • In September 2009 Google ran 3,467,000 MR Jobs with an average 475 sec completion time averaging 488 machines per MR and utilising 25.5K Machine years
    • Technique extensively used by Yahoo with Hadoop (similar architecture to Google) and Facebook (since 06 multiple Hadoop clusters, one being 2500CPU/1PB with HBase).


  • Batch Submission/Scheduler System 批量提交和调度系统
  • Arbitrates (process priorities) Schedules, Allocates Resources, process failover, Reports status, collects results 优先级分配资源,处理failover,汇报状态
  • Workqueue can manage many tens of thousands of machines 管理上万机器
  • Launched via API or command line (sawzall example shown)
saw --program code.szl --workqueue testing
--input_files /gfs/cluster1/2005-02-0[1-7]/submits.* \
--destination /gfs/cluster2/$USER/output@100



  • Google PROFITS US $16M A DAY
  • “Libraries are the predominant way of building programs”
  • Agile Methodologies Used (development iterations, teamwork, collaboration, and process adaptability throughout the life-cycle of the project) #todo: 敏捷开发?
  • An infrastructure handles versioning of applications so they can be release without a fear of breaking things = roll out with minimal QA #todo: 持续集成?