Open Access System for Information Sharing

Department of Creative IT Engineering (창의IT융합공학과) 3. Theses_Ph.D.

Thesis

Cited 0 time in webofscience

Cited 0 time in scopus

Metadata Downloads

Discrete Representation Learning for Visual Image Generation

Title: Discrete Representation Learning for Visual Image Generation

Authors: 이도엽

Date Issued: 2022

Publisher: 포항공과대학교

Abstract: 본 학위논문에서는 이미지의 이산 표현 학습이 초거대 이미지 생성 모델의 핵심적인 요소임을 제시한다. 이미지를 구성하는 픽셀들을 직접 생성하는 것은 천문학적으로 비싼 계산 비용을 요구하기 때문에, 이미지를 효율적으로 표현하는 것이 초거대 생성 모델에 필수적이다. 따라서, 본 학위 논문에서는 이미지의 이산 표현 학습을 위한 잔차-양자화 변분 오토인코더 (Residual-Quantized Variational Auto-Encoder, RQ-VAE)을 제안한다. RQ-VAE는 고정된 크기의 코드북을 이용하여 이미지의 특징 맵 (feature map)을 정교하게 근사할 수 있는 모델으로, 이미지의 시각 정보를 보존하면서 이미지를 코드 스택 (code stack)의 시퀀스 (sequence)로 표현한다. 이후, 코드 스택의 시퀀스로 표현된 이미지를 생성하기 위한 RQ-Transformer 모델을 제안한다. RQ-Transformer는 이미지 생성을 위해 자기 회귀 (autoregressive) 또는 마스크된 코드 스택 (masked code stack) 예측 방식으로 학습될 수 있으며, 학습된 후에는 학습 방식에 따른 디코딩 방법을 통하여 RQ-VAE의 코드들을 생성할 수 있다. 실험적으로, RQ-Transformer는 대규모 데이터를 학습할 수 있으며 이를 통해 현존하는 기술 중 최고 수준의 이미지 생성 성능을 달성함과 동시에, 기존 접근들에 비해 빠른 생성 속도와 적은 계산 비용을 갖는다는 것을 보인다. 이러한 결과를 바탕으로, 이미지의 효율적인 이산 표현 학습과 효과적인 생성 모델의 결합이 초거대 이미지 생성 모델을 위해 필수적인 요소라는 것을 보인다.
This dissertation presents that learning discrete representations of images is the fundamental part of large-scale generative models for image generation. Since generating whole pixels in an image leads to prohibitively expensive computational costs, efficient representations of images are necessary to large-scale generative models. Thus, we propose Residual-Quantized Variational Auto-Encoder (RQ-VAE) for learning discrete representation of visual images. Specifically, given a fixed size of codebook, RQ-VAE can precisely approximate the feature map of an image and represent the image as a sequence of discrete code stacks, while preserving visual information of the image. After an image is represented as a sequence of code stacks, RQ-Transformer is proposed as a generative model for learning to predict the code stacks of RQ-VAE for image generation. RQ-Transformer can be trained by the training objective of both autoregressive and masked code stack modeling to generate discrete codes of RQ-VAE. In experiments, we demonstrate that discrete representations of images by RQ-VAE makes RQ-Transformer achieve the state-of-the-art performance on image generation, while reducing the computational costs and having faster sampling speed than previous approaches to generate high-quality image generation. In conclusion, efficient discrete representations of visual images enable a large-scale generative model to achieve the state-of-the-art performance on image generation.

URI: http://postech.dcollection.net/common/orgView/200000640918
https://oasis.postech.ac.kr/handle/2014.oak/117416

Article Type: Thesis

Files in This Item:: There are no files associated with this item.

Show full item record

qr_code

트윗하기

Communities & Collection

Department of Creative IT Engineering (창의IT융합공학과)

Open Access System for Information Sharing

Communities & Collection

Views & Downloads

Browse