Vector Engine Type20 micro-architecture overview

Nov 1, 2020, Takuma SAITO (Processor Architecture Engineer), Satoru NAGASE (Processor Architecture Manager), AI Platform Division, NEC Corporation

Vector Engine, which is the heart of SX-Aurora TSUBASA, has been updated to 2nd generation (VE20). This document give an explanation on the overview of VE20 micro-architecture.

Vector Architecture

The vector is suited for large data processing

  • High efficiency: a vector instruction can control 3kB data processing
  • High usability: vector operation is easily enabled by auto vectorization compiler

Vector Engine is particularly suitable for workloads processing a large amount of data, which can provide users with high efficiency and high usability. Vector architecture features large capacity of vector registers. Vector Engine has 64 set of vector registers and each register can accommodate 256 elements of 8Byte data (2kByte in total). And various vector instructions can process 256 elements of data all at once. NEC's dedicated vector compiler can automatically find where vectorization is applicable in your source code and output optimal assembly code. As special programing skill to do vectorization is not required, users can easily enjoy the benefit of powerful vector computing.

Data Size Comparison of SIMD and Vector

This slide shows schematic comparison between SIMD generally utilized in CPU (such as Intel x86 or Fujitsu A64fx) and Vector architecture utilized in Vector engine.

Typical 512bit SIMD deals with 8 elements of double precision data in one cycle. On the other hand, Vector engine can deal with the maximum of 256 elements (correspond to 16384bit SIMD) and operates 32 elements in one cycle which is 4times faster than 512bit SIMD.

As explained above, Vector processor can take care of a large amount of data by just one vector instruction. That is why Vector processor can realize high power-efficiency.

  • Components
    • 10 vector cores
    • 16MB LLC
    • 2D mesh network on chip
    • DMA engine
    • 6 HBM2 controllers and interfaces
    • PCI Express Gen3 x16 interface
  • Specs
Core frequency 1.6 GHz
Core performance 307 GF(DP)
614 GF(SP)
CPU performance 3.07 TF(DP)
6.14 TF(SP)
Memory Bandwidth 1.53 TB/s
Memory Capacity 24/48 GB
  • Technology
    • 16nm FinFet process

The image on the right shows the layout of Vector Engine Type 20 processor. VE20 processor is composed of 10 Vector cores (while previous VE10 has 8 Vector cores), LLC, HBM2 I/F (HBM2 memory controller), 6 HBM memories, DMA Engine and PCIe Interface.

Powerful Vector core works at the clock of 1.6GHz and can deliver as much as 307GF on Double precision and 614GF on Single precision. Total of 10 vector core, VE20's theoretical performance comes to 3.07TF on DP and 6.14TF on SP.

VE20 processor also includes 16MB of LLC (Last Level cache) which is divided into half and put at the both side of 10 cores. 10 vector cores and LLCs are connected with high bandwidth by multiple layer of 2-dimensional mesh networks called NOC (Network on Chip).
This high bandwidth NOC enable LLC to feed sufficient data to 10 vector cores. The detail structure of NOC is elaborated in later slide.

LLC is divided into multiple banks and each LLC bank has a connection to HBM2 memory controller. Total of 6 HBM2 controllers are placed near 6 HBM2s to control HBM2 memories respectively. HMB2 memory used for VE20 is newly enhanced, so total memory bandwidth increase up to 1.53TB/s which is 25% higher than 1.22TB/s of VE10. Following the previous VE10,VE20 is one of processors which have the world's highest level of memory bandwidth. This HBM2 controller is custom-made to deliver high memory load/store performance for various HPC and ML workloads.

One of notable features of VE20 is that it achieve the high B/F ratio (Byte per Flops ratio) of 0.5 which keeps higher ratio than other processors. It will contribute to deliver high effective performance in real HPC and ML applications. In addition, PCIe controller is located at the bottom of the ASIC die and has PCIe gen3 x16 interface to outside. Also DMA engine is located at the opposite side. It deals with data transfer between HBM2 memory and x86 host memory via PCIe interface.

After this slide, prime components of this processor are explained in detail.

Vector Core

  • Vector Processing Unit (VPU)
    • Powerful computing capability
      • 307.2 GFLOPS DP/614.4 GFLOPS SP performance

    • High Bandwith Memory Access
      • 409.6 GB/S Load and Store

  • Scalar Processing Unit (SPU)
    • Provides the basic functionality as a processor
      • Fetch, decode, branch, add, exception handling, etc. …

    • Controls the status of the complete core

  • Address translation and data forwarding crossbar
    • To support continuous vector memory access
      • 16 elements/cycle vector address generation and translation, 17 requests/cycle issuing
      • 409.6 GB/sec load and 409.6 GB/sec store data forwarding

This slide shows an overview of vector core. The vector core consists of four major parts.

Vector Processing Unit (VPU)
VPU is the most characteristic unit of the Vector core. VPU has 128KB of Vector registers and provides powerful computing capability. Besides, VPU has a very powerful memory access capability and its theoretical memory bandwidth per core exceeds 400GB/s for load and store each.

Scalar Processing Unit (SPU)
SPU provides basic processor functions such as instruction fetch, decode, branch, and exception handling etc. It also plays a central role in processing and controls all other parts in vector core including VPU.

Address generation and translation / Data forwarding crossbar
Address generation and translation and Request crossbar make memory load/store packets and forward them to the right port of memory network. On the other hand, Reply crossbar forwards reply packets from memory network to 32 VPPs. These blocks are designed to support the continuous operation for VPU. In the vector processing, the pre-load feature is very important to hide the latency of memory load and avoid a lack of necessary data to Vector pipelines. When Address generation and translation block receives vector load instructions from SPU in advance, it can performs address translation for multiple vector elements having separate memory addresses all at once. And then it can make and issue up to 17 memory packets simultaneously. The Data forwarding crossbar can transfer memory load/store packets at the theoretical rate of 409.6GB/s for both load and store. These blocks are designed so that they have the same bandwidth as VPU's processing rate.

Vector Processing Unit

  • Four pipelines, each 32-way parallel
    • FMA0: FP fused multiply-add, integer multiply
    • FMA1: FP fused multiply-add, integer multiply
    • ALU0/FMA2: Integer add, multiply, mask, FP FMA
      ⇒ FMA0, FMA1, ALU0/FMA2: Total 96FMAs
    • ALU1/Store: Integer add, store, complex operation

  • Doubled SP performance by 32bit x 2 packed vector data support
  • Vector register (VR) renaming with 256 physical VRs
    • 64 architectural VRs are renamed
      • Enhanced preload capability
      • Avoidance of WAR and WAW dependencies

  • OoO scheduling
  • Dedicated complex operation pipeline to prevent pipeline stall
    • Vector sum, divide, mask population count, etc.

The right diagram illustrates the structure of Vector processing unit (VPU). VPU consists of 32 Vector pipelines called VPP and their control blocks such as Instruction buffer and renaming and scheduling blocks.

These control brocks decode the instruction from SPU and do renaming of Vector registers and scheduling and then finally give instruction to 32 VPPs at the same time.

VPP structure is very simple. The main components are the six execution pipelines and the transfer block among them. The six execution pipeline is as follows.

  • 3 FMA pipelines (FMA0, FMA1, FMA2)
  • 2 ALU pipelines (ALU0, ALU1)
  • Store/Complex operation pipeline

"FMA2 and ALU0" and "ALU1 and store" share the read port, so the effective number of pipeline is four.

VPU design aims at two points. The first is to provide powerful computing capability.
To achieve that end, three floating point pipelines that support Fused multiply-add (FMA) are implemented. In total of 32 VPPs , it is possible for VPU to perform 96 FMA operations per cycle. It also supports the packed data type for vector operation which can doubles arithmetic performance for single precision compared to double precision.

Another goal is to maintain continuous operation of VPP. For that purpose, the register renaming function for vector register is introduced. VPU has 256 physical vector registers, and 64 architectural vector registers can be renamed to 256 physical vector registers. This contributes to improve the preload capability and avoid unnecessary WAR and WAW dependencies. Also VPU includes Out-of-Order Scheduling feature and offload s complex operations to dedicated complex operation pipeline outside VPP. This is aimed to prevent pipeline stall caused by the long latency of complex operations such as vector sum, divide etc.

Scalar Processing Unit

  • General enhancements
    • 4 instructions/cycle fetch and decode
    • Sophisticated branch prediction
    • OoO scheduling
    • 8-level speculative execution
    • Four scalar instruction pipes
    • Two 32 kB L1 caches + unified 256 kB L2 cache
    • Hardware prefetching

  • Support for continuous vector operation
    • Dedicated vector instruction pipe
    • 16 elements/cycle coherency control for vector store

SPU block diagram is shown on the right. The components of SPU are as follows.

  • Fetch
  • Decode
  • Scheduler
  • 5 pipelines
  • 32KB L1 cache x2
    • Instruction cache
    • Operand cache

  • 256KB L2 unified cache

SPU provides basic processor functions such as instruction fetch, decode, branch, and exception handling. It also controls other parts in the core such as managing the state of VPU.

The performance of Vector engine is largely provided by VPU, but the performance of SPU is also important. It is because SPU needs to process sequential part which cannot be vectorized, and also it needs to calculate the base address for vector memory access and dispatch sufficient vector instructions to VPU to fill a Vector pipelines.

SPU can fetch, decode and execute 4 instructions in parallel per each cycle. Many extensions are implemented such as advanced branch prediction and intelligent hardware prefetch to L2 cache.
In addition, the following enhancements are made to support efficient vector processing.

  • Implement the dedicated vector pipeline to issue vector instruction to VPU.
  • Vector cache coherent control block to support fast vector memory access.

Memory Subsystem

  • High Bandwidth
    • 409.6 GB/S x2 core bandwidth
    • Over 3 TB/s LLC bandwidth
    • 1.53 TB/s memory bandwidth

  • Caches
    • Scalar L1/L2 caches on each core
    • 16 MB shared LLC

  • Two memory networks
    • 2D mesh NoC for core memory access
    • Ring bus for DMA and PCIe traffic

  • DMA engine
    • Used by both vector cores and x86 node
    • Can access VE memory, VE registers, and x86 memory

This slide describes the memory subsystem of Vector Engine. The strength of Vector engine is its memory bandwidth, so memory subsystem is designed to provide sufficient bandwidth from Vector core to memory.

Regarding cache hierarchy, three level of caches are implemented in Vector engine. L1 and L2 caches are located in SPU, which are private caches for instruction and scalar data. Last Level Cache (LLC) is the shared cache that is shared with all 10 Vector cores.

Regarding memory network, two kind of networks are implemented in Vector engine. The first one is NoC.
Each vector core is connected to NoC with the bandwidth of 409.6GB/s for load and store each. Also 8 distributed LLCs are connected to NoC and the total LLC bandwidth reaches 3TB/s. LLCs connected to six HBM2 memories via the Memory controller and the total memory bandwidth of HBM2 is 1.53TB/s (256GB/s x 6pcs).

The other is the ring bus. The bi-directional ring bus is implemented to connect all LLCs and DMA engine. The ring bus has the bandwidth of 16GB/s x2 (bi-direction) and is mainly used for PCIe and DMA traffic. DMA engine is virtualized and can be used from both vector cores and x86 simultaneously. In addition, DMA engine can access all memory address space such as x86 memory, VE memory and VE registers mapped to PCIe address space. As Vector engine has memory protection function, DMA engine can be used safely from user process.

Network on Chip (NoC)

  • 2D mesh network
    • Maximize bandwidth with minimal wiring
    • Minimizing data transfer distance
    • 16 layered mesh

  • Deadlock avoidance
    • Dimension-ordered Routing
    • Virtual channels for request and reply

  • Adaptive flow control
  • Age based QoS control

This diagram shows the physical image of Network on Chip (NoC) implemented in Vector engine. The 2D-mesh topology is introduced for NoC because the bandwidth can be maximized with limited wiring resource in the die and also the data transfer distance can be minimized. NoC is physically composed of 16 layer of 2D-mesh networks as shown the diagram. This makes it possible to transfer numerous data in parallel between all vector cores and LLCs.

The yellow box is a vector core and green box is LLC. Although not shown in the figure, another cores exist on top and bottom side.

The purple box is the Embedded router which is placed at the cross point of each layer and connects Vector cores and LLCs. The green wire indicates network wires connecting 2 routers and LLCs. The 16 layer of routers are aligned in a diamond shape in order to make use of limited wire resource effectively and to minimize the distance between the crossbar in vector core and the routers. These connecting wires are realized by using the upper metal layer in the die so that data can be transferred over long distance with low latency.

General functions such as dead-lock avoidance function and adaptive flow control and QoS control function based on age are implemented.

Last Level Cache (LLC)

  • Memory side cache
    • Avoiding massive snoop traffic
    • Increasing efficiency of indirect memory access

  • 16 MB, write back
  • Inclusive of L1 and L2
  • High Bandwidth design
    • 128 banks, in total more than 3 TB/S bandwidth

  • Auto data scrubbing
  • Assignable data buffer feature
    • Priority of data can be controlled by a flag for vector memory access instructions

This slide describes Last Level Cache (LLC). LLC is memory-side cache which is aimed to avoid massive snoop traffics and increase the efficiency of indirect direct memory access. LLC is inclusive of L1 and L2 caches and the total capacity of LLC is 16MB. To achieve as much as 3TB/s of bandwidth, LLC is divided into 128 banks (8 groups x 16 banks). LLC supports automatic data scrubbing function to prevent 1bit errors from accumulating and developing into multi-bit errors. And the assignable data buffer feature is implemented. By using this feature, user can control the replacement priority among caching data via the indication flag for vector memory access instruction.

- Intel and Xeon are trademarks of Intel Corporation in the U.S. and/or other countries.
- NVIDIA and Tesla are trademarks and/or registered trademarks of NVIDIA Corporation in the U.S. and other countries.
- Linux is a trademark or a registered trademark of Linus Torvalds in the U.S. and other countries.
- Proper nouns such as product names are registered trademarks or trademarks of individual manufacturers.