Building Scalable, Highly Concurrent & Fault-Tolerant Systems: Lessons Learned

1. It's All Trade-offs

How do I know if I have a performance problem? If your system is slow for a single user
How do I know if I have a scalability problem? If your system is fast for a single user but slow under heavy load
Latency vs Throughput You should strive for maximal throughput with acceptable latency
Availability vs Consistency

Shared mutable state Together with threads 线程使用共享可修改状态使得代码不稳定
- code that is totally INDETERMINISTIC
- and the root of all EVIL
The problem with locks 锁带来的问题
- Locks do not compose 锁不能够进行组合
- Locks breaks encapsulation 破坏封装
- Taking too few locks
- Taking too many locks
- Taking the wrong locks
- Taking locks in the wrong order 错误顺序
- Error recovery is hard 错误恢复处理
You deserve better tools 高并发更好的工具和做法
- Dataflow Concurrency 基于数据流的并发
  - Deterministic
  - Declarative
  - Data-driven

Never block
- …unless you really have to
- Blocking kills scalability (and performance)
- Never sit on resources you don’t use
- Use non-blocking IO
Go Async
- Use asynchronous message passing
- Design reactive event-driven systems
- Use push not pull or poll #note: 可靠性是个问题
- Don’t use explicit thread management
How fast is fast enough?
- Measure, measure and measure
- Start with a baseline
- Define “good enough”
- Beware of micro-benchmarks

Werner Vogels’ Misconceptions about Reliable Distributed Computing

Worth keeping an eye on

Imperative OO programming (a la Hadoop) doesn't cut it
- Object-Mathematics Impedance Mismatch
- We need functional processing, transformations etc.
- Examples:Crunch/Scrunch,Cascading,Cascalog, Scalding, Scala Parallel Collections
- Is the assembly language of MapReduce programming
- Watch “Why Big Data Needs To Be Functional” by Dean Wampler
Batch processing (a la Hadoop) doesn't cut it
- We need real-time data processing
- Examples:Spark,Storm,GridGain,Akkaetc.