Open Access System for Information Sharing

Login Library

 

Thesis
Cited 0 time in webofscience Cited 0 time in scopus
Metadata Downloads
Full metadata record
Files in This Item:
There are no files associated with this item.
DC FieldValueLanguage
dc.contributor.author권은지-
dc.date.accessioned2024-08-23T16:30:53Z-
dc.date.available2024-08-23T16:30:53Z-
dc.date.issued2024-
dc.identifier.otherOAK-2015-10580-
dc.identifier.urihttp://postech.dcollection.net/common/orgView/200000809131ko_KR
dc.identifier.urihttps://oasis.postech.ac.kr/handle/2014.oak/123970-
dc.descriptionDoctor-
dc.description.abstractThis thesis delves into two significant aspects of software and hardware optimiza- tions for energy efficient deep learning systems. The first focus is on optimization for AI, which entails developing better hardware systems to enhance AI algorithms. The second focus is on optimization by AI, which involves using AI techniques to improve AI model performance. More specifically, this research proposes various post-training quantization (PTQ) methods and their dedicated HW optimizations for energy-efficient deep learning systems, specifically for vision transformers. The transformer, one of the deep learning models, has been increasingly used not only in natural language processing but also in computer vision applications. Com- pared to Convolutional Neural Networks (CNNs), transformers achieve higher per- formance with far fewer parameters in tasks such as image classification and object detection in computer vision applications. However, due to their high memory and computational complexity, using transformer models poses challenges. Techniques such as pruning and model variants are commonly used to reduce parameters while maintaining or improving model accuracy. In the pursuit of deploy- ing deep learning models on resource-constrained devices or for efficient large-scale deployment, PTQ also has emerged as a crucial technique for compressing neural net- works without sacrificing performance. PTQ is advantageous because it reduces the model’s memory footprint and computational requirements without needing to retrain the model. Additionally, PTQ can be applied to pretrained models, simplifying the optimization process. However, existing PTQ methods often struggle to maintain ac- curacy, especially when applied to highly pruned models or model variants combining convolutional and transformer layers. For example, after pruning more than 50% of parameters, subsequent quantiza- tion without retraining often results in noticeable performance degradation. Further- more, when quantizing models that incorporate a mix of various types of convolution layers and attention layers, it is more challenging than standard convolution or trans- former architectures. To overcome this hurdle, complicated and advanced PTQ tech- niques are employed in this thesis including: 1) dynamic fixed point quantization and 2) arbitrary mixed precision quantization. Despite these optimizations, accelerating these refined models on GPUs, which are typically optimized for standard precision like INT4, INT8, and INT16, and only support mixed precision within these power-of-two levels, remains inefficient. To address these limitations and enable efficient computation of lightweight models in hardware (HW), we pursued two main approaches: 1) A custom HW accelerator for optimized models and 2) efficient mapping arbitrary quantized models to a HBM-PIM architecture. This dissertation presents two research topics that have been conducted to en- able efficient computation of lightweight models after PTQ in HW. Firstly, we devel- oped HW-friendly transformer compression methods which includes structured line- wise pruning and dynamic post training quantization. In addition, we developed a custom accelerator for optimized models, which contains an efficient dynamic fixed- point arithmetic unit that supports dynamic quantization. As a result, the proposed accelerator improves the energy efficiency of Detection Transformer (DETR) and Vi- sion Transformer (ViT) models by 2.9× and 12.3×, respectively, compared to mobile GPUs, and by 3.0× and 12.4×, respectively, compared to mobile CPUs. Secondly, we proposed the autonomous mixed precision quantization framework based on reinforcement learning (RL). This framework utilizes RL to group layers according to mixed precision and determine quantization configurations. Additionally, to efficiently accelerate the arbitrary mixed precision models, we utilized an High Bandwidth Memory Processing-In-Memory (HBM-PIM) architecture that supports bit-serial row-parallel computations. This architecture is particularly well-suited for handling the diverse precision levels required by mixed precision quantization. The arbitrary mixed precision models improve energy efficiency by approximately 10× compared to the baseline model in the HBM-PIM within 0.5% accuracy drop.-
dc.languageeng-
dc.publisher포항공과대학교-
dc.titleSoftware and Hardware Optimization for Energy-Efficient Deep Learning Systems-
dc.typeThesis-
dc.contributor.college전자전기공학과-
dc.date.degree2024- 8-

qr_code

  • mendeley

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

Views & Downloads

Browse