Building Scalable, Highly Concurrent & Fault-Tolerant Systems: Lessons Learned
Table of Contents
1. It's All Trade-offs
- How do I know if I have a performance problem? If your system is slow for a single user
- How do I know if I have a scalability problem? If your system is fast for a single user but slow under heavy load
- Latency vs Throughput You should strive for maximal throughput with acceptable latency
- Availability vs Consistency
2. Go Concurrent
- Shared mutable state Together with threads 线程使用共享可修改状态使得代码不稳定
- code that is totally INDETERMINISTIC
- and the root of all EVIL
- The problem with locks 锁带来的问题
- Locks do not compose 锁不能够进行组合
- Locks breaks encapsulation 破坏封装
- Taking too few locks
- Taking too many locks
- Taking the wrong locks
- Taking locks in the wrong order 错误顺序
- Error recovery is hard 错误恢复处理
- You deserve better tools 高并发更好的工具和做法
- Dataflow Concurrency 基于数据流的并发
- Deterministic
- Declarative
- Data-driven
- Dataflow Concurrency 基于数据流的并发
- Threads are suspended until data is available
Lazy & On-demand
- No difference between
- Concurrent &
- Sequential code
- Actors 轻量线程模式,传递消息方式进行通信
- Share NOTHING
- Isolated lightweight event-based processes
- Each actor has a mailbox (message queue)
- Communicates through asynchronous and non-blocking message passing
- Examples: Akka, Erlang
- Software Transactional Memory (STM) 软件事务内存,更新内存是原子操作,类似DB的transaction实现
- See the memory as a transactional dataset
- Similar to a database
- Transactions are retried automatically upon collision
- Rolls back the memory on abort
- Agents 相当于worker角色,做一些异步操作的工作
- No difference between
3. Go Fault-Tolerant
- Failure management in Java/C/C# etc 在独立的线程里面必须进行错误处理,否则外部没有办法发现错误。这样在线程里面整个错误处理贯穿于逻辑本身
- You are given a SINGLE thread of control
- If this thread blows up you are screwed
- So you need to do all explicit error handling WITHIN this single thread
- To make things worse - errors do not propagate between threads so there is NO WAY OF EVEN FINDING OUT that something have failed
- This leads to DEFENSIVE programming with:
- Error handling TANGLED with business logic
- SCATTERED all over the code base
- The right way 正确的方式是每个线程都有独立的监控线程,所有的错误都会发送到这个监控线程,然后由这个监控线程进行处理。在语义上来说remote和local是一样的,这点可能更容易做错误控制
- Isolated Processes (Units of Computation)
- Process Supervision
- Each running process has a supervising process
- Errors are sent to the supervisor
- Supervisor manages the failure
- Same semantics local as remote
- For example the Actor Model solves it nicely
4. Go Faster & Go More
- Never block
- …unless you really have to
- Blocking kills scalability (and performance)
- Never sit on resources you don’t use
- Use non-blocking IO
- Go Async
- Use asynchronous message passing
- Design reactive event-driven systems
- Use push not pull or poll #note: 可靠性是个问题
- Don’t use explicit thread management
- How fast is fast enough?
- Measure, measure and measure
- Start with a baseline
- Define “good enough”
- Beware of micro-benchmarks
5. Go Distributed
Werner Vogels’ Misconceptions about Reliable Distributed Computing
- Transparency is the ultimate goal
- Automatic object replication is desirable
- All replicas are equal and deterministic
Worth keeping an eye on
- The CALM Conjecture
- Could be the future of Distributed Computing
- Declarative
- Deterministic
- Removes TIME, i.e. the need for ordering
- Check out the BLOOM language
6. Go Big
6.1. Data
- Imperative OO programming (a la Hadoop) doesn't cut it
- Object-Mathematics Impedance Mismatch
- We need functional processing, transformations etc.
- Examples:Crunch/Scrunch,Cascading,Cascalog, Scalding, Scala Parallel Collections
- Is the assembly language of MapReduce programming
- Watch “Why Big Data Needs To Be Functional” by Dean Wampler
- Batch processing (a la Hadoop) doesn't cut it
- We need real-time data processing
- Examples:Spark,Storm,GridGain,Akkaetc.
6.2. DB
- Scaling reads to a RDBMS is hard
- Scaling writes to a RDBMS is impossible