Benefitting Power and Performance Sleep Loops

https://software.intel.com/content/www/us/en/develop/articles/benefitting-power-and-performance-sleep-loops.html

如何更好地使用使用自旋锁：

如果只是简单地不断地去check spinlock, 那么会非常占用CPU。
如果使用sleep(0)或者是sched_yield()的话，那么会导致ring3 -> ring0的context switch. 延迟会非常高。
最好的方式检查几轮spinlock, 然后使用sleep(0), sched_yield()切换出去，如此往复。

不过在检查spinlock的时候，可以使用 mm_pause 这个指令。使用这个指令可以告诉CPU, 接下来的指令是是要去check spinlock, 所以不用full-speed地去检查，比如完全填满流水线这样，最后功能也能节省4%。这个指令之后接下来的执行可能会延迟一段时间，但是这个延迟时间是不可控，完全由CPU去决定的，所以我们不能依赖或者是假设这个延迟。

Essentially, the pause instruction delays the next instruction's execution for a finite period of time. By delaying the execution of the next instruction, the processor is not under demand, and parts of the pipeline are no longer being used, which in turn reduces the power consumed by the processor.

The pause instruction can be used in conjunction with a Sleep(0) to construct something similar to an exponential back-off in situations where the lock or more work may become available in a short period of time, and the performance may benefit from a short spin in ring 3. It is important to note that the number of cycles delayed by the pause instruction may vary from one processor family to another. You should avoid using multiple pause instructions, assuming you will introduce a delay of a specific cycle count. Since you cannot guarantee the cycle count from one system to the next, you should check the lock in between each pause to avoid introducing unnecessarily long delays on new systems.

这个指令按照Intel文档解释来说，就是专门给spinlock场景设计的，从文档看上去这个指令的latency很高（不知道这个latency是不是就是到接下来执行指令的延迟）。

void _mm_pause (void) #include <emmintrin.h> Instruction: pause CPUID Flags: SSE2 Provide a hint to the processor that the code sequence is a spin-wait loop. This can help improve the performance and power consumption of spin-wait loops.

Architecture Latency Throughput (CPI) Skylake 140 140

最后实现出来的代码长的是这个样子的：

ATTEMPT_AGAIN:
  if (!acquire_lock())
  {
    /* Spin on pause max_spin_count times before backing off to sleep */
    for(int j = 0; j < max_spin_count; ++j)
    {
      /* pause intrinsic */
      _mm_pause();
      if (read_volatile_lock())
      {
        if (acquire_lock())

        {
          goto PROTECTED_CODE;
        }
      }
    }

    /* Pause loop didn't work, sleep now */
    Sleep(0);
    goto ATTEMPT_AGAIN;
  }
PROTECTED_CODE:
  get_work();
  release_lock();
  do_work();