Data-parallel kernels dominate the computational workload in a wide variety of demanding application domains, including graphics rendering, computer vision, audio processing, physical simulation, and machine learning. Specialized data-parallel accelerators have long been known to provide greater energy and area efficiency than general-purpose processors for codes with significant amounts of regular data-level parallelism (DLP). With continuing improvements in transistor density and an increasing emphasis on energy efficiency, there has recently been growing interest in DLP accelerators for mainstream computing environments. Surveying the wide range of data-parallel accelerator cores in industry and academia reveals a general tradeoff between programmability (how easy it is to write software for the accelerator) and efficiency (energy/task and tasks/second/area). In this project, we explored a new approach to building data-parallel accelerators based on the vector-thread architectural design pattern, and we used this approach to more broadly examine the tension between programmability and efficiency.
The vector-thread (VT) architectural design pattern includes a control thread that manages a vector of microthreads. The control thread uses vector memory instructions to efficiently move blocks of data between memory and each microthread's registers, and vector fetch instructions to broadcast scalar instructions to all microthreads. These vector mechanisms are complemented by the ability for each microthread to direct its own control flow when necessary. This logical view is implemented by mapping the microthreads both spatially and temporally to a set of vector lanes contained within a vector-thread unit (VTU). A seamless intermixing of vector and threaded mechanisms allows VT to potentially combine the energy efficiency of SIMD accelerators with the flexibility of MIMD accelerators.
The Scale VT processor was our first implementation of these ideas specifically targeted for use in embedded devices. Scale used a generic RISC instruction set architecture for the control thread and a microthread instruction set specialized for VT. The Scale microarchitecture included a simple RISC control processor and a complicated (but efficient) four-lane VTU. The Scale programming methodology required either a combination of compiled code for the control processor and assembly programming for the microthreads or a preliminary vectorizing compiler specifically written for Scale. For more information, see the Scale website.
Based on our experiences designing, implementing, and evaluating Scale, we identified three primary directions for improvement to simplify both the hardware and software aspects of the VT architectural design pattern: (1) a unified VT instruction set architecture; (2) a VT microarchitecture more closely based on the vector-SIMD pattern; and (3) an explicitly data-parallel VT programming methodology. These ideas formed the foundation for the Maven VT architecture. In order to evaluate the Maven VT architecture, we implemented a common set of parameterized synthesizable microarchitectural components, and then composed these components to form complete RTL designs for a variety of different architectural design patterns including: multi-core tiles consist of four MIMD cores; four single-lane vector-SIMD cores, four single-lane VT cores, multi-lane vector-SIMD tiles, and multi-lane VT tiles. We generated and analyzed hundreds of complete VLSI layouts for the MIMD, vector-SIMD, and VT tiles targeting a modern 65 nm technology on a workload of compiled microbenchmarks and application kernels. We pushed each configuration through synthesis, place-and-route (PAR), and performed a post-PAR gate-level simulation across all benchmarks to obtain accurate area, energy, and timing estimates.
Our detailed VLSI results confirmed that vector-based microarchitectures are more area and energy efficient than scalar-based microarchitectures on regular DLP, and surprisingly even for fairly irregular DLP. Our results suggest that the efficiency and programmability of our new Maven VT architecture makes it a compelling alternative to traditional vector-SIMD architectures.
This work was supported in part by Microsoft (Award #024263) and Intel (Award #024894, equipment donations) funding and by matching funding from U.C. Discovery (Award #DIG07-10227).