The Secret To 10 Million Concurrent Connections -The Kernel Is The Problem, Not The Solution

1. Introduction
2. Packet Scaling - Write Your Own Custom Driver To Bypass The Stack
3. Multi-Core Scalability
4. Memory Scalability
5. Summary

http://highscalability.com/blog/2013/5/13/the-secret-to-10-million-concurrent-connections-the-kernel-i.html

1. Introduction

Don’t let the kernel do all the heavy lifting. Take packet handling, memory management, and processor scheduling out of the kernel and put it into the application, where it can be done efficiently. Let Linux handle the control plane and let the the application handle the data plane. If this seems extreme keep in mind the old saying: scalability is specialization. To do something great you can’t outsource performance to the OS. You have to do it yourself. With a data plane oriented system you can process 10 million packets per second. With a control plane oriented system you only get 1 million packets per second. （简而言之就是一个定制化的操作系统。这里给我们透露的一个比较关键的信息就是，对于通用的操作系统来说，我们应该是有可能达到C1000K的）

The result will be a system that can handle 10 million concurrent connections with 200 clock cycles for packet handling and 1400 hundred clock cycles for application logic. As a main memory access costs 300 clock cycles it’s key to design in way that minimizes code and cache misses.（对于C10M来说，对于内存访问开销的控制需要非常注意）

If you are handling 5,000 connections per second and you want to handle 10K, what do you do? Let’s say you upgrade hardware and double it the processor speed. What happens? You get double the performance but you don’t get double the scale. The scale may only go to 6K connections per second. Same thing happens if you keep on doubling. 16x the performance is great but you still haven’t got to 10K connections. Performance is not the same as scalability. （必须区分性能和扩展性）

Let Unix handle the network stack, but you handle everything from that point on.

2. Packet Scaling - Write Your Own Custom Driver To Bypass The Stack

The problem with packets is they go through the Unix kernel. The network stack is complicated and slow. The path of packets to your application needs to be more direct. Don’t let the OS handle the packets.（应用程序自己进行packet处理）
The way to do this is to write your own driver. All the driver does is send the packet to your application instead of through the stack. You can find drivers: PF_RING, Netmap, Intel DPDK (data plane development kit). The Intel is closed source, but there’s a lot of support around it.
How fast? Intel has a benchmark where the process 80 million packets per second (200 clock cycles per packet) on a fairly lightweight server. This is through user mode too. The packet makes its way up through to user mode and then down again to go out. Linux doesn’t do more than a million packets per second when getting UDP packets up to user mode and out again. Performance is 80-1 of a customer driver to a Linux.（这个差距稍微有点大）
For the 10 million packets per second goal if 200 clock cycles are used in getting the packet that leaves 1400 clocks cycles to implement functionally like a DNS/IDS.（如果要达到C10M的话，还剩下1400 clock cycles来处理逻辑）
With PF_RING you get raw packets so you have to do your TCP stack. People are doing user mode stacks. For Intel there is an available TCP stack that offers really scalable performance.

3. Multi-Core Scalability

Locks in Unix are implemented in the kernel. What happens at 4 cores using locks is that most software starts waiting for other threads to give up a lock. So the kernel starts eating up more performance than you gain from having more CPUs.
What we need is an architecture that is more like a freeway than an intersection controlled by a stop light. We want no waiting where everyone continues at their own pace with as little overhead as possible.
Solution：
- Keep data structures per core. Then on aggregation read all the counters.
- Atomics. Instructions supported by the CPU that can called from C. Guaranteed to be atomic, never conflict. Expensive, so don’t want to use for everything.
- Lock-free data structures. Accessible by threads that never stop and wait for each other. Don’t do it yourself. It’s very complex to work across different architectures.
- Threading model. Pipelined vs worker thread model. It’s not just synchronization that’s the problem, but how your threads are architected.
- Processor affinity. Tell the OS to use the first two cores. Then set where your threads run on which cores. You can also do the same thing with interrupts. So you own these CPUs and Linux doesn’t.

4. Memory Scalability

The problem is if you have 20gigs of RAM and let’s say you use 2k per connection, then if you only have 20meg L3 cache, none of that data will be in cache. It costs 300 clock cycles to go out to main memory, at which time the CPU isn’t doing anything. Think about this with our 1400 clock cycle budge per packet. Remember 200 clocks/pkt overhead. We only have 4 cache misses per packet and that's a problem （也就是，只能够存在4次cache miss的机会）
Co-locate Data （将数据连续存放）
- Don’t scribble data all over memory via pointers. Each time you follow a pointer it will be a cache miss: [hash pointer] -> [Task Control Block] -> [Socket] -> [App]. That’s four cache misses.
- Keep all the data together in one chunk of memory: [TCB | Socket | App]. Prereserve memory by preallocating all the blocks. This reduces cache misses from 4 to 1.
Paging
- The paging table for 32gigs require 64MB of paging tables which doesn’t fit in cache. So you have two caches misses, one for the paging table and one for what it points to. This is detail we can’t ignore for scalable software.
- Solutions: compress data; use cache efficient structures instead of binary search tree that has a lot of memory accesses （减少存储空间来减少cache miss）
- NUMA architectures double the main memory access time. Memory may not be on a local socket but is on another socket.（NUMA结构的话如果不在local socket那么主存时间翻倍）
Memory pools
- Preallocate all memory all at once on startup.
- Allocate on a per object, per thread, and per socket basis.
Hyper-threading
- Network processors can run up to 4 threads per processor, Intel only has 2.
- This masks the latency, for example, from memory accesses because when one thread waits the other goes at full speed.
Hugepages （减少地址翻译过程）
- Reduces page table size. Reserve memory from the start and then your application manages the memory.

5. Summary

NIC
- Problem: going through the kernel doesn’t work well.
- Solution: take the adapter away from the OS by using your own driver and manage them yourself
CPU
- Problem: if you use traditional kernel methods to coordinate your application it doesn’t work well.
- Solution: Give Linux the first two CPUs and you application manages the remaining CPUs. No interrupts will happen on those CPUs that you don’t allow.
Memory
- Problem: Takes special care to make work well.
- Solution: At system startup allocate most of the memory in hugepages that you manage.