Google Cluster Computing Faculty Training Workshop

Table of Contents

1 Module 1 - Introduction to Distributed Systems Teaching

Parallel vs. Distributed

  • Parallel computing can mean:
    • Vector processing of data (SIMD)
    • Multiple CPUs in a single computer (MIMD)
  • Distributed computing is multiple CPUs across many computers (MIMD)

What is Different in Distributed?

  • Higher inter-CPU communication latency
  • Individual nodes need to act more autonomously
  • Different nodes can be heterogeneous (by function, location…)
  • System reliability is much harder to maintain

“A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable” – Leslie Lamport

Reliability Demands

  • Support partial failure. Total system must support graceful decline in application performance rather than a full halt
  • Data Recoverability. If components fail, their workload must be picked up by still-functioning units
  • Individual Recoverability. Nodes that fail and restart must be able to rejoin the group activity without a full group restart
  • Consistency. Concurrent operations or partial internal failures should not cause externally visible nondeterminism
  • Scalability. Adding increased load to a system should not cause outright failure, but a graceful decline. Increasing resources should support a proportional increase in load capacity
  • Security. The entire system should be impervious to unauthorized access. Requires considering many more attack vectors than single-machine systems

“Failure is the defining difference between distributed and local programming” – Ken Arnold, CORBA designer


  • Component Failure. Individual nodes simply stop
  • Data Failure. Packets omitted by overtaxed router, Or dropped by full receive-buffer in kernel, Corrupt data retrieved from disk or net
  • Network Failure. External & internal links can die. Some can be routed around in ring or mesh topology, Star topology may cause individual nodes to appear to halt, Tree topology may cause “split”, Messages may be sent multiple times or not at all or in corrupted form…
  • Timing Failure. Temporal properties may be violated. Lack of “heartbeat” message may be interpreted as component halt. Clock skew between nodes may confuse version-aware data readers
  • Byzantine Failure. Difficult-to-reason-about circumstances arise. Commands sent to foreign node are not confirmed: What can we reason about the state of the system?
  • Malicious Failure. Malicious (or maybe naïve) operator injects invalid or harmful commands into system
  • Correlated Failures. Multiple CPUs/hard drives from same manufacturer lot may fail together. Power outage at one data center may cause demand overload at failover center

Preparing for Failure

  • Distributed systems must be robust to these failure conditions, But there are lots of pitfalls…
  • Dealing With Component Failure
    • Use heartbeats to monitor component availability
    • “Buddy” or “Parent” node is aware of desired computation and can restart it elsewhere if needed
    • Individual storage nodes should not be the sole owner of data
  • Dealing With Data Failure
    • Data should be check-summed and verified at several points. Never trust another machine to do your data validation!
    • Sequence identifiers can be used to ensure commands, packets are not lost
  • Dealing With Network Failure
    • Have well-defined split policy
    • Networks should routinely self-discover topology
    • Well-defined arbitration/leader election protocols determine authoritative components
    • Inactive components should gracefully clean up and wait for network rejoin
  • Dealing With Other Failures
    • Individual application-specific problems can be difficult to envision
    • Make as few assumptions about foreign machines as possible
    • Design for security at each step

2 Module 3 - Nutch

#note: 介绍nutch工作过程



  • Acquisition cycle
    • WeDB (可以认为是LinkDB,保存page meta信息以及links)
      • Contains info on all pages, links
      • URL, last download, # failures, link score, content hash, ref counting
      • Source hash, target URL
    • Fetcher
      • Pre-MapRed: divide “to-fetch list” into k pieces, one for each fetcher machine (对to-fetch list进行划分)
      • URLs for one domain go to same list, otherwise random (从fetch里面取出之后按照url进行partition放在一个reduce完成)
        • “Politeness” w/o inter-fetcher protocols
        • Can observe robots.txt similarly
        • Better DNS, robots caching
        • Easy parallelism
      • Two outputs: pages, WebDB edits (输出网页内容以及对webdb的修改)
  • Index generation
    • Indexing
      • Uses Lucene text indexer
    • Link analysis (maybe)
      • Link analysis is sexy, but importance generally overstated
      • Nutch performs analysis in WebDB
      • Emit a score for each known page
      • At index time, incorporate score into inverted index
  • Serving results

3 Module 4 - MapReduce Theory and Algorithms

#todo: 介绍了许多经典算法如何映射到MapReduce上面