开个新坑,看各大会议的Abstract,感兴趣的文章看Introduction。以求快速获得新Ideas。
Industry
《Data Compression Accelerator on IBM POWER9 and z15 Processors : Industrial Product》,IBM research/IBM Systems
- 硬件压缩提升I/O和网络性能,减少存储和内存开销
- on-chip的压缩加速器,0.5%的面积换388x的加速比(POWER9),23%的实际性能提升
- 重点在trade-off
《High-Performance Deep-Learning Coprocessor Integrated into x86 SoC with Server-Class CPUs Industrial Product》,Centaur Technology(一家x86 CPU芯片设计公司,1995年成立,美国德克萨斯)
- DL coprocessor (NCORE) + x86 SoC(server-class CPU)
- int8,uint8,int16,bf16, 20T ops/s
- MLPref Inference v0.5 1218IPS, 1.05ms latency ResNet-50-v1.5/ 0.329ms MobileNet-V1
《The IBM z15 High Frequency Mainframe Branch Predictor Industrial Product》,IBM system group
- 多级look-ahead structure
- 能预测branch direction和target addresses,增强:multiple auxiliary direction, target, and power predictors
- 为enterprise-class system的特定workloads优化
《Evolution of the Samsung Exynos CPU Microarchitecture》,作者来自于Sifive/Centaur/Independent Consultant/ARM/Texas A&M University/AMD/Nuvia/Goodix
- 讨论了三星Exynos家族从M1到M6的设计变化
- perceptron-based branch prediction/Spectre v2 security enhancements/micro-operation cache algorithms/prefetcher advancements/memory latency optimizations
《Xuantie-910: A Commercial Multi-Core 12-Stage Pipeline Out-of-Order 64-bit High Performance RISC-V Processor with Vector Extension : Industrial Product》,Alibaba T-head division
- 基于RISC-V RV64GCV指令集+自定义算术运算/bit manipulation/load&store/TLB和cache operations+RISCV 0.7.1向量扩展
- 支持多核多cluster的SMP(对称多处理) with cache coherence+12级流水/乱序执行/多发射超标量/2.5GHz/12nm/每个核心0.8mm^2
- 软件和工具链的co-optimization,在RISC-V上表现最好,和ARM比有来有回
CPU based
《Divide and Conquer Frontend Bottleneck》, Sharif University of Technology, Sharif University of Technology谢里夫理工大学(伊朗麻省理工),Ali Ansari, Hamid Sarbazi-Azad
- instruction和BTB(branch-target-buffer分支预测先取的) miss导致的frontend stalls很大,现有的预取器不行
- 指令的miss penalty远大于buffer的miss
- 把Frontend的bottleneck分成三类分别处理。
- sequential miss,SN4L
- discontinuity miss, Dis
- BTB miss, pre-decoding the prefetched blocks
- 5%的提升
《Auto-Predication of Critical Branches*》,Intel Labs(Bengaluru, India和Haifi, Israel)
- H2P(hard-to-predict)和mis-speculation限制了分支预测的scalability。Predication(同时取两个分支的数据)将控制依赖关系替换成数据依赖可以缓和这个问题,但可能降低指令并行性。
- 分析了trade-off(prediction和predication),提出ACB,自动根据是否critical to performance来关闭predication。
- 使用复杂的性能检测。8%的提升
《Slipstream Processors Revisited: Exploiting Branch Sets》
《Bouquet of Instruction Pointers: Instruction Pointer Classifier-based Spatial Hardware Prefetching》
《Focused Value Prediction*》,Intel Labs(Bengaluru, India/Haifa, Israel)
《Flick: Fast and Lightweight ISA-Crossing Call for Heterogeneous-ISA Environments》
《Efficiently Supporting Dynamic Task Parallelism on Heterogeneous Cache-Coherent Systems》
《T4: Compiling Sequential Code for Effective Speculative Parallelization in Hardware》
Accelerators
《Think Fast: A Tensor Streaming Processor (TSP) for Accelerating Deep Learning Workloads》,Groq Inc(加州山景城).
- 内存单元插入到vector/matrix计算单元,以利用dataflow的locality
- observations 1机器学习数据并行可以映射到硬件张量中 2stream programming model可以准确理解和控制硬件单元,带来更好的performance
- TSP探索并行性,包括指令级别、存储并发、数据和模型并行,同时保证determinism通过减少所有的硬件反应元素(arbiters caches)
- 20.4K IPS ResNet50
- Functional slicing: local functional homogeneity but chip-wide (global) heterogeneity, each tile竖条条 implements a specific function and is stacked vertically,dataflow是左右走的,并且指令分别发到每一个竖条条上。
- Parallel lanes and streams: Streams provide a programming abstraction and are a conduit导向 through which data flows between functional slices.
《Genesis: A Hardware Acceleration Framework for Genomic Data Analysis》
《DSAGEN: Synthesizing Programmable Spatial Accelerators》
《Bonsai: High-Performance Adaptive Merge Tree Sorting》
《Gorgon: Accelerating Machine Learning from Relational Data》
《DSAGEN: Synthesizing Programmable Spatial Accelerators》
《A Specialized Architecture for Object Serialization with Applications to Big Data Analytics》
《SpinalFlow: An Architecture and Dataflow Tailored for Spiking Neural Networks》, Utah大学, Surya Narayanan, Pierre-Emmanuel Gaillardon
- SNN dataflow需要考虑多个tick的neuron potentials,带来了新的数据结构和新的数据pattern。
- 提出SpinalFlow,处理Compressed,time-stamped, sorted sequence的输入输出;一个神经元执行一系列步骤的计算来减少potential的存储开销,better data reuse
《NEBULA: A Neuromorphic Spin-Based Ultra-Low Power Architecture for SNNs and ANNs》
Security
《MuonTrap: Preventing Cross-Domain Spectre-Like Attacks by Capturing Speculative State》
System-Level
《SysScale: Exploiting Multi-domain Dynamic Voltage and Frequency Scaling for Energy Efficient Mobile Processors》
《The NEBULA RPC-Optimized Architecture》
《CryoCore: A Fast and Dense Processor Architecture for Cryogenic Computing》
《Heat to Power: Thermal Energy Harvesting and Recycling for Warm Water-Cooled Datacenters》
Others
《Printed Microprocessors》
《Déjà View: Spatio-Temporal Compute Reuse for‘ Energy-Efficient 360° VR Video Streaming》
《SOFF: An OpenCL High-Level Synthesis Framework for FPGAs》
《Hardware-Software Co-Design for Brain-Computer Interfaces》