Open Access System for Information Sharing

Department of Computer Science & Engineering (컴퓨터공학과) 3. Theses_Ph.D.

Thesis

Cited 0 time in webofscience

Cited 0 time in scopus

Metadata Downloads

Neural Automatic Post-Editing for Machine Translation

Title: Neural Automatic Post-Editing for Machine Translation

Authors: 이원기

Date Issued: 2022

Publisher: 포항공과대학교

Abstract: Automatic Post-Editing (APE) is a study of automatically correcting the errors in the output of a black-box machine translation (MT) system to improve the translation quality. APE can also be considered a multi-source sequence-to-sequence problem in that it receives both a source (src) and MT output (mt) to produce a post-edited text (pe). This dissertation presents two research directions for APE. First, we present a method for constructing an APE model based on the Transformer network. We propose a joint representation when encoding mt, in which each mt token and its corresponding src context are bound together so that the model is able to capture the dependency between src and mt. Furthermore, we propose a multi-source attention strategy for the decoder, which enables the decoder to attend to the src context both locally and globally. The experimental results revealed that our proposed method successfully improves the APE performance. Second, we focus on approaches for generating synthetic APE data. We introduce a method that adapts the back-translation technique to APE, called back-APE, by which the resulting synthetic data can simulate the error statistics that are observed in genuine APE data. Moreover, we suggest the application of random sampling to the decoding process of back-APE to maintain the output diversity. Finally, we demonstrate that our synthetic data are significantly beneficial in APE training and exhibit superior performance to that when the currently dominant synthetic data, eSCAPE, is used.
기계번역 자동 사후 교정 (Automatic Post-Editing: APE) 은 기계번역 시스템의 번역 결과에 포함된 오류를 수정하여 결과적으로 향상된 품질의 번역 결과를 출력하는 것을 목표로 하는 연구이다. 또한, 원문과 번역문을 동시에 고려하여 교정문을 생성한다는 문제의 특성을 고려했을 때, 기계번역 자동 사후 교정은 다중 소스 스퀀스-투-스퀀스 (multi-source sequence-to-sequence) 문제로도 귀결된다. 본 학위논문에서는 기계번역 자동 사후 교정의 성능 향상을 목표로 2가지 연구 방향을 제시한다. 첫 번째는 모델링에 관한 연구로, 번역문 자동 사후 교정을 수행하기 적합한 형태로 트랜스포머 신경망 구조를 변형하는 방법을 제시한다. 번역문은 원문에서부터 기인된 결과라는 점을 고려하여, 번역문의 각 단어에 대한 표상 (representation) 을 생성하는 과정에서 각 번역 단어에 해당하는 원문의 문맥 정보를 함께 포함시켜 공동 표상 (joint representation) 을 구축하는 방안을 제시한다. 더불어, 교정문을 생성하는 과정에서 원문의 전역 문맥과 지역 문맥 모두를 고려하기 위한 다중 소스 주의 전략 (multi-source attention) 을 제안한다. 결과적으로, 본 학위논문에서 제시하는 트랜스포머 기반의 모델링 방법을 통해 번역문 자동 사후 교정의 성능이 향상됨을 확인하였다. 두 번째로는 데이터 부족 문제를 완화하기 위해, 병렬 말뭉치를 사용하여 기계번역 자동 사후 교정을 위한 합성 데이터를 구축하는 방법에 관한 연구를 제시한다. 본 논문은 역 번역 (back-translation) 방법을 본 문제에 적용하는, 이른바 역교정 (back-APE) 방법을 소개한다. 더불어, 임의추출법 (random sampling) 을 적용하여 역교정 모델로부터 생성되는 합성 데이터의 다양성을 확보한다. 결과적으로, 제안하는 방법으로 생성된 합성 데이터는 모델의 성능을 크게 향상시켰으며, 나아가 기존에 널리 사용되던 합성 데이터인 eSCAPE 데이터를 크게 상회하는 성능을 보여주었다.

URI: http://postech.dcollection.net/common/orgView/200000632601
https://oasis.postech.ac.kr/handle/2014.oak/117426

Article Type: Thesis

Files in This Item:: There are no files associated with this item.

Show full item record

qr_code

트윗하기

Communities & Collection

Department of Computer Science & Engineering (컴퓨터공학과)

Open Access System for Information Sharing

Communities & Collection

Views & Downloads

Browse