Open Access System for Information Sharing

Department of Computer Science & Engineering (컴퓨터공학과) 3. Theses_Ph.D.

Thesis

Cited 0 time in webofscience

Cited 0 time in scopus

Metadata Downloads

Architectural Support for Heterogeneous System Programming

Title: Architectural Support for Heterogeneous System Programming

Authors: 김영석

Date Issued: 2017

Publisher: 포항공과대학교

Abstract: Heterogeneous systems consisting of several types of processors have become prevalent. Today, almost every desktop is equipped with a graphics processing unit (GPU) which provides significantly higher throughput than central processing units (CPUs) and is capable of accelerating not only graphics processing, but also general-purpose computing. To exploit the high computational throughput of GPUs on heterogeneous systems, programmers rewrite their CPU code to use GPU-specific application programming interfaces (APIs) such as DirectX and OpenGL for graphics processing and CUDA and OpenCL for general-purpose computing. Although rewriting CPU code only to guarantee functional correct- ness may not be a difficult task, doing so in a way that fully exploits the high computational bandwidth of GPUs is a non-trivial task due to the characteristics of GPU-specific APIs and microarchitectures. First, GPU-specific APIs demand programmers to explicitly manage GPU memory. This leads to poor programmability as programmers must write the GPU memory management code not present in CPU code. In addition, explicitly managing GPU memory achieves poor performance as memory management code and computation code must be executed in a serialized manner to guarantee functional correctness. Second, current implementations to scale single-GPU applications with multiple GPUs fail to efficiently utilize the abundant resources of multiple GPUs, making programmers difficult to rely on their code optimized for single-GPU systems for optimal multi-GPU performance. This forces programmers to re-optimize (and rewrite if needed) their GPU code to fully exploit the higher performance potential of multi-GPU systems. Thus, there is a great need for software/hardware support to achieve CPU-like programmability which does not demand explicit memory management to programmers and optimizes GPU code under the hood for both single- and multi-GPU systems. In this thesis, we propose two novel GPU architectures to address the poor programmability of GPU-specific APIs. First, we introduce GPUdmm, a novel GPU architecture to avoid the need for GPU memory management code and the serialized execution of memory management code and computation code. First, GPUdmm uses GPU memory as a cache of CPU memory to provide programmers a view of the CPU memory-sized programming space. Second, GPUdmm achieves high performance by exploiting data locality and dynamically transferring data between CPU and GPU memories while effectively overlapping CPU-GPU data transfers and GPU executions. Third, GPUdmm can further reduce unnecessary CPU-GPU data transfers by exploiting simple programmer hints. Our carefully designed and validated experiments (e.g., PCIe/DMA timing) against representative benchmarks show that GPUdmm can achieve up to five times higher performance for the same GPU memory size, or reduce the GPU memory size requirement by up to 75% while maintaining the same performance. Second, we propose a novel multi-GPU architecture called GPUpd to achieve scalable multi-GPU graphics processing without the need for any code modifications to single-GPU code. With small hardware extension, GPUpd introduces a new graphics pipeline stage called Cooperative Projection & Distribution where all GPUs cooperatively project 3D objects to 2D screen and efficiently redistribute the objects to their corresponding GPUs. To minimize the redistribution overheads, GPUpd optimizes inter-GPU communication by batching and runahead-executing draw commands. We evaluate GPUpd with eight real-world game traces and by extending ATTILA simulator. Without requiring any application code modification, GPUpd achieves a geomean speedup of 4.98x in single-frame latency on a 16-GPU system, whereas the state-of-the-art multi-GPU architecture achieves only 3.07x geomean speedup which saturates on 4 or more GPUs.

URI: http://postech.dcollection.net/jsp/common/DcLoOrgPer.jsp?sItemId=000002328163
https://oasis.postech.ac.kr/handle/2014.oak/93529

Article Type: Thesis

Files in This Item:: There are no files associated with this item.

Show full item record

qr_code

트윗하기

Communities & Collection

Department of Computer Science & Engineering (컴퓨터공학과)

Open Access System for Information Sharing

Communities & Collection

Views & Downloads

Browse