Open Access System for Information Sharing

Login Library

 

Thesis
Cited 0 time in webofscience Cited 0 time in scopus
Metadata Downloads

Towards Efficient Neural Network Inference with Model Compression

Title
Towards Efficient Neural Network Inference with Model Compression
Authors
변영훈
Date Issued
2024
Publisher
포항공과대학교
Abstract
Algorithm-hardware co-design is becoming increasingly important as we aim to deploy transformers demonstrating scale-proportional performance across diverse domains effectively. The ever-increasing demand for deploying complex neural network models in practical applications necessitates the development of memory-efficient solutions that maximize memory bandwidth utilization even for compressed models. In the first part of this thesis, we deeply investigated memory interface overheads resulting from irregular data accessing patterns, which are prevalent in pruned DNN models. Leveraging the state-of-the-art XOR-gate compression, we introduce a sparsity-aware memory interface architecture and the innovative stacked XORNet solution. These advancements significantly reduce data imbalances and interface costs while maintaining high-speed pruned-DNN inference capabilities. Our experimental results showed that the proposed algorithm-hardware co-design can boost effective bandwidth with reasonable hardware costs. In the second part of this thesis, we extend our investigation from fine-grained pruning to partially structured pruning, which drastically reduces the local sparsity fluctuation. Although the previous stacked XORNet compression method reduced the local sparsity fluctuation, the hardware overhead from XOR-gate compression error is hard to ignore. Therefore, we propose a Patch-Limited XOR-gate compression, Partially-Structured Transformer pruning, and Bit-wise Patch Reduction techniques tailored for XOR-gate compression. These methods reduce the required patches, simplifying the decompressor architecture and minimizing correction efforts. The introduced systems successfully reduced the number of errors and normalized error distribution, achieving 23% higher effective bandwidth than the previously introduced State-of-the-art work. Our research underscores the significance of memory interface optimization for efficiently deploying pruned neural network models. Through comprehensive investigations and innovative solutions, this thesis contributes to the field by providing cost-efficient, high-speed memory interface architectures that bridge the gap between advanced model compression techniques and hardware implementation. These findings have profound implications for future computing systems, enabling the seamless integration of complex neural networks in practical applications.
URI
http://postech.dcollection.net/common/orgView/200000805734
https://oasis.postech.ac.kr/handle/2014.oak/124018
Article Type
Thesis
Files in This Item:
There are no files associated with this item.

qr_code

  • mendeley

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

Views & Downloads

Browse