Computer Architecture Readings - Princeton - Review/Superscalar/VLIW
ELE/COS 475 Computer Architecture
计算机架构在不断变化,有同时来自门电路技术改进以及软件需求。计算机体系结构主要研究中间三层:ISA,微架构,RTL。下面开始的门电路/逻辑电路属于更底层的硬件层,而上面则属于软件层。
ISA和Microarch之间的差别,对于软件开发者只需要关注到ISA这层就行,而MA层则是由芯片设计者来决定如何高效地实现ISA这层语义。
Architecture vs. Microarchitecture “Architecture”/Instruction Set Architecture:
- Programmer visible state (Memory & Register)
- Operations (Instructions and how they work)
- Execution Semantics (interrupts)
- Input/Output
- Data Types/Sizes
Microarchitecture/Organization:
- Tradeoffs on how to implement ISA for some metric (Speed, Energy, Cost)
- Examples: Pipeline depth, number of pipelines, cache size, silicon area, peak power, execution ordering, bus widths, ALU widths
ISA差异很大的原因有下面这些
Technology Influenced ISA
- Storage is expensive, tight encoding important
- Reduced Instruction Set Computer
- Remove instructions until whole computer fits on die
- Multicore/Manycore – Transistors not turning into sequential performance
Application Influenced ISA
- Instructions for Applications
- DSP instructions
- Compiler Technology has improved
- SPARC Register Windows no longer needed – Compiler can register allocate effectively
顺序处理器的性能变化,RISC出现在1986年,2006年开始放弃在单核上做改进转向多核。
现代处理器需要考虑的事情非常多:指令/数据/线程级别并行,超长流水,内存和缓存技术
指令中几种阻碍深流水线的因素:
- Structural Hazard: An instruction in the pipeline needs a resource being used by another instruction in the pipeline (使用到相同的运算/控制单元,解决办法如下)
- Schedule: Programmer explicitly avoids scheduling instructions that would create structural hazards 调整指令熟顺序
- Stall: Hardware includes control logic that stalls until earlier instruction is no longer using contended resource 暂停流水
- Duplicate: Add more hardware to design so that each instruction can access independent resources at the same time 冗余的运算/控制单元
- Data Hazard: An instruction depends on a data value produced by an earlier instruction(多条指令之间存在数据依赖,解决办法如下)
- Schedule: Programmer explicitly avoids scheduling instructions that would create data hazards 调整指令顺序
- Stall: Hardware includes control logic that freezes earlier stages until preceding instruction has finished producing data value 暂停流水
- Bypass: Hardware datapath allows values to be sent to an earlier stage before preceding instruction has left the pipeline 调整流水线结构,可以提前得到数据
- Speculate: Guess that there is not a problem, if incorrect kill speculative instruction and restart 推测执行
- Control Hazard: Whether or not an instruction should be executed depends on a control decision made by an earlier instruction(控制结构比jb/jbe/jmp这些,使用分支预测解决)
几种常见/可预测的内存访问模式,可以看到都是满足时间/空间局部性的:获取指令,堆栈访问,向量/标量化数据的访问。
可视化地观察时间/空间局部性
Cache几种Missing分类:3C, Compulsory(第一次访问), Capacity(容量不够造成的淘汰), Conflict(冲突造成的淘汰, 实际上容量是足够的)
Cache设计上的权衡:N-way, Block Cache, Cache Size. Block Cache在64, N-way上越大越好,Cache Size越大越好。
N-way上, 1-way的访问时间是最短的,但是2/4/8时间其实差别不大很大,但是1-way的miss rate却非常高,所以理论上选择8-way是应该是更好的选择。
Cache Block Size Pros & Cons: 好处就是一次获取数据更多带宽更大,而坏处就是如果数据没有完全访问的话那么就相当于浪费带宽,而且更大的Block Size会导致更少的cache items, 冲突率更大。从下图可以看到几乎Block Size = 64 是个最优值,不过也不好说是不是软件在优化上就使用了block size = 64这个事实。
Cache Size有个法则就是:Cache Size翻倍, miss rate降低30%. (1-1/(2^0.5))
VLIW要求将多条操作打包在一个指令里面,并且操作之间是相互独立的:使用不同的计算/控制单元,不存在数据之间的依赖。从PPT里面来看,每个slot里面还有具体的cycle latency要求,看起来这个对于编译器的要求非常高。
实际上VLIW问题是比较多的(一些点没有看懂):
VLIW Compiler Responsibilities
- Schedule operations to maximize parallel execution
- Guarantees intra-instruction parallelism
- Schedule to avoid data hazards (no interlocks)
- Typically separates operations with explicit NOPs
Problems with “Classic” VLIW
- Object-code compatibility (二进制兼容性)
- have to recompile all code for every machine, even for two machines in same generation
- Object code size (二进制大小)
- instruction padding wastes instruction memory/cache
- loop unrolling/software pipelining replicates code
- Scheduling variable latency memory operations
- caches and/or memory bank conflicts impose statically unpredictable variability
- Knowing branch probabilities
- Profiling requires an significant extra step in build process
- Scheduling for statically unpredictable branches
- optimal schedule varies with branch path
- Precise Interrupts can be challenging – Does fault in one portion of bundle fault whole bundle? – EQ Model has problem with single step, etc.