Using Block Prefetch for Optimized Memory Performance

AMD_block_prefetch_paper.pdf 这篇文章比较早,2001年AMD公司写的。里面使用了两个例子来说明如何有效地进行block prefetch来优化内存性能。一个例子是memcpy, 另外一个则是将两个浮点数组相加,这里面就只说第一个例子。

memcpy是个比较经典的例子,文章中也列举出来了各种方法来进行加速,大部分方法也是广为人知。这里我想摘抄一下里面是如何做block prefetch的。最开始他们使用的指令是 "prefetchnta",这个指令对于CPU来说只是一个hint, 在执行的时候其实完全可以忽略的。为了"真实”地进行block prefetch, 我们可以使用mov指令。

Significantly, the MOV instruction is used, rather than the software prefetch instruction. Unlike a prefetch instruction, which is officially only a “hint” to the processor, a MOV instruction cannot be ignored and must be executed in-order. The result is that the memory system reads sequential, back-to-back address blocks, which yields the fastest memory read bandwidth.


这里一个CACHEBLOCK是L2 Cache大小。大致意思就是,在真正地去读src内容之前,先对一个cache block的内容进行“真正”地prefetch, 我们只取每个cache line上的头4字节,这样CPU会将对应的这条cache line也放入进来。注意这里要求src地址必须是和64字节对齐的,否则就没有这个效果了。另外注意在写入到dst的时候,使用的是movntq是不会涉及到cache line的,最后需要做一下sfence(因为我们使用了ntq这样的指令)。


One additional trick is to read the cache lines in descending address order, rather than ascending order. This can improve performance a bit, by keeping the processor’s hardware prefetcher from issuing any redundant read requests.


Block prefetch and three phase processing are general techniques for improving the performance of memory-intensive applications on PCs. In a nutshell, the key points are:

#1 To get the maximum memory read bandwidth, read data into the cache in large blocks (e.g. 1K to 8K bytes), using block prefetch. When creating a block prefetch loop:

#2 To get maximum memory write bandwidth, write data from the cache to main memory in large blocks, using streaming store instructions. When creating a memory write loop:

#3 Whenever possible, code that actually “does the real work” should be reading its data from cache, and writing its output to an in-cache buffer. To enable this to happen, use #1 and #2 above.


Aligning “hot” branch targets to 16 byte boundaries can improve speed, by maximizing the number of instruction fills into the instruction-byte queue. This is especially important for short loops, like a block prefetch loop. This wasn’t shown in the code examples, for the sake of readability. It can be done with the ALIGN pragma, like this:

align 16
  mov ebx, [esi+ecx*8-64]
  mov ebx, [esi+ecx*8-128]
  sub ecx, 16
  dec eax
  jnz  prefetchloop