Let’s talk locks!

https://www.infoq.com/presentations/go-locks/

使用spinlokc容易空转CPU,所以为了避免浪费CPU有时候需要陷入到kernel里面挂起,linux下面可以使用futex来做到。

使用CAS/spinlock的开销大致如下

locks-atomic-op-cost.png

Linux futex 使用方式大致如下:

locks-futex-wait.png

locks-futex-wake.png

内核实现上使用阻塞队列进行管理

locks-futex-in-kernel.png

虽然futex不会白白地耗费CPU,但是因为有了context switch, 所以多个线程冲突的时候延迟会非常高:12个冲突线程从130ns -> 900ns(0.9us)

locks-futex-cost.png

go sync.Mutex在上面做了一些优化:

locks-go-sync-mutex.png

Slide结尾给了不少有用的链接:

  1. 如何使用profiling工具定位竞争冲突 https://github.com/iovisor/bcc/issues/892 http://brendangregg.com/offcpuanalysis.html
  2. 避免/减少竞争冲突:lock-free, granular lock, 缩小关键区域 http://www.1024cores.net/ http://www.ece.ubc.ca/~sasha/papers/eurosys16-final29.pdf
  3. content-aware scheduler https://web.eecs.umich.edu/~mozafari/php/data/uploads/lock-schd-report.pdf

https://docs.google.com/document/d/1TTj4T2JO42uD5ID9e89oa0sLKhJYD0Y_kqxDv3I3XMw/edit#

最多只有一个spinlock空转等待获取任务,其他线程则block住

Spinning is two-level: (1) an idle M with an associated P spins looking for new G’s, (2) an M w/o an associated P spins waiting for available P’s. There are at most GOMAXPROCS spinning M’s (both (1) and (2)). Idle M’s of type (1) do not block while there are idle M’s of type (2). When a new G is spawned, or M enters syscall, or M transitions from idle to busy, it ensures that there is at least 1 spinning M (or all P’s are busy). This ensures that there are no runnable G’s that can be otherwise running; and avoids excessive M blocking/unblocking at the same time. Spinning is mostly passive (yield to OS, sched_yield()), but may include a little bit of active spinning (loop burnging CPU) (requires investigation and tuning).

在调度方面的考虑:亲和性,延迟创建资源,以及避免过多的线程创建

  1. Try out LIFO scheduling, this will improve locality. However, it still must provide some degree of fairness and gracefully handle yielding goroutines.
  2. Better locality of G-to-P. Try to enqueue an unblocked G to a P on which it was last running.
  3. Throttling of M creation. The scheduler can be easily forced to create thousands of M's per second until OS refuses to create more threads. M’s must be created promptly up to k*GOMAXPROCS, after that new M’s may added by a timer.
  4. Do not allocate G and stack until the goroutine first runs. For a newly created goroutine we need just callerpc, fn, narg, nret and args, that is, about 6 words. This will allow to create a lot of running-to-completion goroutines with significantly lower memory overhead.
  5. Better locality of P-to-M. Try to execute P on the same M it was last running.