Advances in deep learning have become the main drivers of revolutions in various commercial and enterprise applications. Due to the increasing model size and the large amount of training data, in recent years, we have witnessed a phenomenal paradigm shift from general-purpose multicores (many-cores) to domain-specific system architectures.
In general, when designing deep learning accelerators, people have focused on two problems with different design objectives. For inference, since the accelerators will typically be used in an energy-constraint environment (e.g., mobile devices), the goal is to achieve high energy-efficiency. For training, in which a large amount of data need to be processed, the key is to achieve high throughput with parallel and distributed techniques. In this blog post, we would like to emphasize the importance of the systematic and structured approaches in developing efficient deep learning accelerator architectures.
As shown in recent study, data movement consumes significant energy compared to arithmetic operations; thus, it is important to reduce DRAM access. One intuitive but important technique is to compress models without reducing accuracy. The early approach is to conduct non-structured pruning, which relies on iterative and heuristic methods. This approach suffers from two problems: i) only achieving limited and non-uniform compression ratio; and more importantly, ii) may negatively affecting performance due to the irregular weigh access. To recover the performance loss, various architectures (e.g., EIE, SCNN, SparTan, ExTensor) have been proposed to accelerate the irregular and sparse computations.
If we take a step back and rethink carefully, these problems indicate a clear trade-off between compressed models and hardware efficiency. However, is the trade-off fundamental or superficial? The recent studies strongly suggested the latter case. In another word, it is entirely possible to generate compressed models that both preserve high accuracy and hardware efficiency at the same time. The key insight is to introduce “structures” in the models and enforce such structures during model compression. It is not hard to see that the structures enable efficient and simpler hardware implementation and even novel compiler optimizations that are not possible with non-structured pruning. At hardware level, the complex architectural support for index or intersection calculation in sparse models can be largely avoided. At compiler level, with the known structures of pruned kernels, it is possible to increase parallelism by ordering the processing of similar structures together and eliminate redundant loads based on known kernel structures (e.g., non-zeros can only occur at certain positions). Figure 1 shows the difference between non-structured and structured weight pruning.
A less intuitive question is whether models with fixed structures can still achieve high accuracy? Recent work ADMM-NN showed that Alternating Direction Methods of Multipliers (ADMM) is an extremely powerful tool to achieve both high compression ratio and high accuracy. The key advantage of ADMM is that the structures can be expressed as optimization constraints, thus this method can naturally generate structured compressed models with negligible accuracy loss. Being applicable for both structured and non-structured pruning, ADMM, a systematic approach, seems to be a more promising alternative than the traditional heuristic iterative pruning. This article comprehensively compared the potential of structured and non-structured pruning using a common ADMM-based pruning and quantization framework. The results show that, with the same accuracy, non-structured pruning is not competitive in terms of both storage and computation efficiency. I believe the community needs to rethink the value of non-structured pruning and emphasize more on algorithm and hardware co-design. Indeed, enforcing certain structures in compressed models and leveraging the structures to optimize performance at hardware and compiler level are excellent examples of such effort.
The unique aspect of structured pruning is that, essentially, both accuracy and hardware implementation can be considered as a “function” of the selected structures. This can unify various points in the whole design space. With relatively restrictive structures (e.g., circulant matrices or permuted diagonal matrices), the hardware can be very simple and efficient with modest accuracy degradation. With fine-grained patterns inside coarse-grained structures, the accuracy can be largely preserved, but more effort need to be devoted to compiler optimizations to recover execution efficiency. Clearly, identifying proper structures in a large design space is an interesting open problem.
Finally, we turn to accelerator architecture for training. The fundamental problem here is how to use multiple accelerators (e.g., GPU, TPU, or any specific implementation) to achieve high throughput in processing the large amount of training data. Different from inference, each accelerator is normally abstracted as a “black box” with certain computing capability. In this scenario, the communication between accelerators becomes a key problem. Specifically, we need to consider the sophisticated interactions between model architecture (layer types and how layers are connected), parallelism configuration (data or model or hybrid parallelism), and architecture (computation, communication, and synchronization cost). As the first step to solve the problem, “one weird trick” (OWT) suggested using data parallelism for fully connected layer and model parallelism for convolutional layer. Unfortunately, the empirical solution is far from optimal. With different parallelism configurations, the communication can happen at different phases (forward, backward, and gradient) with data transfer amount determined by layer structure. The crux of the problem is exhaustively enumerating all possible partitions of the tensors involved in the three phases. This is why a systematic approach is required.
The recent works HyPar and AccPar advanced state-of-the-art using a systematic approach with a solid mathematical foundation. Based on a clearly defined inter- and intra-layer communication model, the solutions showed significant performance improvement over OWT using dynamic programming to systematically search the design space. Both HyPar and AccPar assume that all accelerators perform parallel training on a layer-by-layer fashion. Figure 2 shows three tensor partition types that constitute the whole design space. There are still plenty of open problems with more parallelism configurations. For example, the recent work PipeDream introduces pipeline parallelism—essentially a form of pipelined model parallelism with each stage as at least one layer. In this scenario, how to achieve the optimal model partition is unclear, especially with the additional requirement of load balancing. Figure 3 shows the potential idea of fine-grained partition and straggler mitigation.
In summary, I believe efficient machine learning accelerator architecture design requires a systematic and structured approach. A systematic method is the key to tackle complex interactions between vertical layers, while structured model pruning provides the unique sweet spot for co-optimizing algorithm and hardware.
Bio: