How to receive a million packets per second

https://blog.cloudflare.com/how-to-receive-a-million-packets/

注: 这个调优过程非常有学习意义

On Linux, how hard is it to write a program that receives 1 million UDP packets per second? (问题是要做到每秒接收1M UDP packets有多困难)

在实验之前,先配置好iptables,以及在receiver机器上分配了20个IP。packets使用batch方式发送, batch size = 1024.


The naive approach

baseline没有任何额外设置

sender$ ./udpsender 192.168.254.1:4321
receiver$ ./udpreceiver1 0.0.0.0:4321
  0.352M pps  10.730MiB /  90.010Mb
  0.284M pps   8.655MiB /  72.603Mb
  0.262M pps   7.991MiB /  67.033Mb
  0.199M pps   6.081MiB /  51.013Mb
  0.195M pps   5.956MiB /  49.966Mb
  0.199M pps   6.060MiB /  50.836Mb
  0.200M pps   6.097MiB /  51.147Mb
  0.197M pps   6.021MiB /  50.509Mb

可以看到pps在195k~352k之间浮动,这是因为进程每次调度被分配到不同的core上运行,使用taskset -c来绑定运行core. 这样可以充分利用cache locality.

sender$ taskset -c 1 ./udpsender 192.168.254.1:4321
receiver$ taskset -c 1 ./udpreceiver1 0.0.0.0:4321
  0.362M pps  11.058MiB /  92.760Mb
  0.374M pps  11.411MiB /  95.723Mb
  0.369M pps  11.252MiB /  94.389Mb
  0.370M pps  11.289MiB /  94.696Mb
  0.365M pps  11.152MiB /  93.552Mb
  0.360M pps  10.971MiB /  92.033Mb

Send more packets

通过两个线程来提供发送pps, 但是可以看到接收pps并没有增长

sender$ taskset -c 1,2 ./udpsender \
            192.168.254.1:4321 192.168.254.1:4321
receiver$ taskset -c 1 ./udpreceiver1 0.0.0.0:4321
  0.349M pps  10.651MiB /  89.343Mb
  0.354M pps  10.815MiB /  90.724Mb
  0.354M pps  10.806MiB /  90.646Mb
  0.354M pps  10.811MiB /  90.690Mb

可以通过ethtool -S查看网口上统计数据. 这个网口有11个接受队列

receiver$ watch 'sudo ethtool -S eth2 |grep rx'
     rx_nodesc_drop_cnt:    451.3k/s
     rx-0.rx_packets:     8.0/s
     rx-1.rx_packets:     0.0/s
     rx-2.rx_packets:     0.0/s
     rx-3.rx_packets:     0.5/s
     rx-4.rx_packets:  355.2k/s
     rx-5.rx_packets:     0.0/s
     rx-6.rx_packets:     0.0/s
     rx-7.rx_packets:     0.5/s
     rx-8.rx_packets:     0.0/s
     rx-9.rx_packets:     0.0/s
     rx-10.rx_packets:    0.0/s

pps中有451k没有被kernel处理而丢弃,355.2k都是在rx-4接收队列上。通常一个接收队列只由一个特定processor来处理中断,所以这个时候会看到某个processor使用率特别高(cs应该也非常高)

Pasted-Image-20231225103823.png

所以我们做的是让packet能够尽可能地均匀地落在每个rx queue上,这样就不会让某个processor成为bottleneck.


Crash course to multi-queue NICs

Historically, network cards had a single RX queue that was used to pass packets between hardware and kernel. This design had an obvious limitation - it was impossible to deliver more packets than a single CPU could handle. To utilize multicore systems, NICs began to support multiple RX queues. The design is simple: each RX queue is pinned to a separate CPU, therefore, by delivering packets to all the RX queues a NIC can utilize all CPUs.

Pasted-Image-20231225104728.png

如何让packet均匀地落在rx-queue上?使用packet-base-round-robin是不可行的,因为这样会造成一个链接上多个packets之间顺序错乱。最好的方式当然是用(src-ip, src-port, dst-ip, dst-port)做hash然后取模。

配置网卡这个hash规则我们同样可以使用ethtool这个工具:ethtool -N eth0 rx-flow-hash udp4 sdfn


A note on NUMA performance

现在场景还是one rx-queue to one cpu. 来看看NUMA造成的性能差异。

While a 10% penalty for running on a different NUMA node may not sound too bad, the problem only gets worse with scale. On some tests I was able to squeeze out only 250kpps per core. On all the cross-NUMA tests the variability was bad. The performance penalty across NUMA nodes is even more visible at higher throughput. In one of the tests I got a 4x penalty when running the receiver on a bad NUMA node.(如果只是上面10%影响的话那么还不算太坏,但是cross-NUMA tests偏差波动会非常大,甚至有有时可以降低75%)

Multiple receive IPs

除了使用多队列的方式外,我们还可以使用多IP来避免单个processor来处理中断。

sender$ taskset -c 1,2 ./udpsender 192.168.254.1:4321 192.168.254.2:4321
receiver$ watch 'sudo ethtool -S eth2 |grep rx'
     rx-0.rx_packets:     8.0/s
     rx-1.rx_packets:     0.0/s
     rx-2.rx_packets:     0.0/s
     rx-3.rx_packets:  355.2k/s
     rx-4.rx_packets:     0.5/s
     rx-5.rx_packets:  297.0k/s
     rx-6.rx_packets:     0.0/s
     rx-7.rx_packets:     0.5/s
     rx-8.rx_packets:     0.0/s
     rx-9.rx_packets:     0.0/s
     rx-10.rx_packets:    0.0/s
receiver$ taskset -c 1 ./udpreceiver1 0.0.0.0:4321
  0.609M pps  18.599MiB / 156.019Mb
  0.657M pps  20.039MiB / 168.102Mb
  0.649M pps  19.803MiB / 166.120Mb

我们绑定两个IP,性能就可以翻倍了。

如果我们继续增加发送IP数量的话,接收pps不再上升,rx_nodesc_drop_cnt也不增加。netstat能够看到接收错误

receiver$ watch 'netstat -s --udp'
Udp:
      437.0k/s packets received
        0.0/s packets to unknown port received.
      386.9k/s packet receive errors
        0.0/s packets sent
    RcvbufErrors:  123.8k/s
    SndbufErrors: 0
    InCsumErrors: 0

这意味着packet可以被投递到kernel, 但是kernel没有办法投递到application,application处理packet速度太慢(386.9k + 123.8k被丢弃)


Receive from many threads

为了提高application处理速度,就需要使用多线程来做处理。可是如果每个线程都从同一个fd读取的话,那么锁冲突的代价非常大反过来称为瓶颈。

可以使用 SO_REUSEPORT 来解决这个问题. SO_REUSEPORT允许多个进程/线程绑定同一个端口(不同fd)来读取写入数据(底层实现是根据<src-ip, src-port, dst-ip, dst-port>做hash放入不同的bucket, 上层fd会在不同bucket里面取出链接来处理。如果某个fd进程挂掉的话,那么这个bucket里面所有packets都会作废)

receiver$ taskset -c 1,2,3,4 ./udpreceiver1 0.0.0.0:4321 4 1
  1.114M pps  34.007MiB / 285.271Mb
  1.147M pps  34.990MiB / 293.518Mb
  1.126M pps  34.374MiB / 288.354Mb

Final words

To sum up, if you want a perfect performance you need to:

While we had shown that it is technically possible to receive 1Mpps on a Linux machine, the application was not doing any actual processing of received packets - it didn't even look at the content of the traffic. Don't expect performance like that for any practical application without a lot more work.(注意这个实验里面没有对packet做任何处理)