优化数学函数案例 - Speeding up atan2f by 50x

https://mazzo.li/posts/vectorized-atan2.html

我觉得外围benchmark代码写的很好 https://gist.github.com/bitonic/d0f5a0a44e37d4f0be03d34d47acb6cf 里面直接调用perf_event_open来获取CPU的执行状态，这样就可以比较方便地统计出比如cycles/branches/branch misses per element这些指标。

优化上大致几个思路：

有限项的级数展开. 里面说用到了 Cecil Hastings Jr. 给出的级数展开系数，展开到第六项就可以确保误差在1/10000度内.
使用的级数展开必须确保简单，易于实现，比如类似使用fma这样的操作 https://en.cppreference.com/w/c/numeric/math/fma 这类简单重复的操作可以很容易地被SIMD化
上面级数展开有个要求是x在[-1,1]之间，然后利用arctan(y/x) + arctan(x/y) = pi/2 来处理其他情况。优化数学函数一定要懂数学。
然后就是实现细节：减少分支判断，用float不用double, 使用fmad函数，手工展开到SIMD指令上。

作者在最后面给了个网站 uiCA https://uica.uops.info/ 可以模拟每条指令在执行时候的cycles. 好像这个对于micro-benchmark/optimization很有用，llvm里面也有个类似工具 llvm-mca https://llvm.org/docs/CommandGuide/llvm-mca.html