Open Access System for Information Sharing

Department of Computer Science & Engineering (컴퓨터공학과) 3. Theses_Ph.D.

Thesis

Cited 0 time in webofscience

Cited 0 time in scopus

Metadata Downloads

복잡한 형태적 구성의 핵 후위 언어를 위한 데이터 기반 의존 구문 분석

Title: 복잡한 형태적 구성의 핵 후위 언어를 위한 데이터 기반 의존 구문 분석

Authors: 이용훈

Date Issued: 2011

Publisher: 포항공과대학교

Abstract: 의존 구문 분석은 CoNLL의 공동 작업(shared tasks)으로 다국어 의존 구문 분석(multi-lingual dependency parsing)이 채택된 이후 가장 활발히 연구되고 있는 연구 주제가 되었다. 의존 구문 분석에서는 단어의 위치나 형태가 크게 중요하지 않기 때문에 형태적으로 다양한 언어들을 위한 공통된 구문 분석 프레임워크를 제공한다. CoNLL의 공동 작업을 통해 많은 언어 독립적 구문 분석 모델들이 여러 언어에 성공적으로 적용되긴 하였으나, 선행 연구들은 복잡한 형태적 구성을 가진 언어의 의존 구문 분석에 있어서는 여전히 많은 연구가 필요하다고 말하고 있다. 본 논문에서는 복잡한 형태적 구성의 핵 후위 언어(MRHFL)을 위한 하나의 청킹(chunking) 모델과 두개의 데이터 기반 의존 구문 분석 모델을 제안한다. 다시 말하면, 복잡한 형태적 구성을 가진 언어(MRL)의 형태적 복잡성과 핵 후위 언어(HFL)의 핵의 위치적 특성을 사용하여 언어를 처리한다는 뜻이다. 제안 모델들을 평가하기 위해 한국어와 일본어를 사용하였다. 먼저 구문 분석의 전처리 단계로 CRF(Conditional Random Field)를 이용한 청킹 방법을 제안한다. 비록 한국어에서는 간단하고 효율적인 규칙 기반 방법이 널리 사용되고 있지만, 이 방법은 규칙의 예외 상황을 처리할 수 없다는 약점을 가지고 있다. 이에 비해, 제안 모델은 입력 문장의 많은 연관 자질로부터 전역해를 구할 수 있는 CRF의 특성 때문에 좀 더 강건하고 정확하다. 두번째로 우리는 동적 자질(dynamic feature)을 이용한 그래프 기반 의존 구문 분석 모델을 제안한다. 상향식 차트 파싱 중에 이미 생성된 부분 트리로부터 동적으로 추출한 이러한 동적 자질들은 올바른 헤드(head)를 찾는 데 중요한 역할을 하게 된다. 한국어의 핵 후위 특성을 반영한 변형된 CYK 알고리즘은 모든 가능한 트리 후보로부터 최대 신장 트리(MST)를 O(n^3) 시간 내에 찾을 수 있으며, 핵 후위 언어의 구문 분석 시, Eisner 알고리즘에 비해 시간적으로 효율적인 특성을 가진다. 또한 MRL의 구문 분석에 왜 그래프 기반 구문 분석 방법이 좀처럼 사용되지 않는지를 설명하고 그 해결 방법으로 자동 변환된 형태소 레벨의 의존 트리 표현 방법을 제시한다. 세번째로 우리는 강화된 의존관계 확인을 통한 새로운 변환 기반 의존 구문 분석 모델을 제안한다. 기본적으로 전통적인 변환 기반 구문 분석 방법은 두 단어 간의 의존관계를 순차적으로 결정해 나감으로써 의존 트리를 완성해 나간다. 비록 두 단어와 그 주변 문맥 정보들을 사용하긴 하지만 이 방법은 여전히 지역적(local)이고 탐욕적(greedy)이다. 만약 이미 결정된 의존관계가 틀릴 경우, 오류가 전파되어 성능 하락으로 연결될 수 있기 때문에, 잘못된 결정을 피하기 위해서 우리는 의존관계 확인 절차를 강화하였다. 이를 위해 두 단어 간에 정말로 의존관계가 존재하는지 확인해 주는 역할의 추가적인 헤드 후보 한 개와, 여러 번의 예측 결과로부터 최종적인 결정을 내리는 수동적인 확정 전략을 사용하였다. 제안 모델들을 한국어의 카이스트 코퍼스와 일본어의 교토대 코퍼스에 적용해 본 결과, 기존의 제안 방법들에 비해 최상의 성능을 얻을 수 있었다. 추후에 제안 방법들을 다른 MRHFL에 적용할 경우 비슷한 정확률을 보이리라 기대된다. 의미 정보와 같은 언어학적 지식과 제안 모델들을 결합하여 추가적인 성능 향상을 얻는 방법은 향후 연구 과제로 남겨 둔다.
Dependency parsing has become an active research topic since CoNLL shared tasks on multi-lingual dependency parsing. As it does not crucially consider the position and form of words, dependency parsing provides a common parsing framework for typologically diverse languages. Although many language-independent models have been applied successfully to multiple languages, previous research pointed out that dependency parsing for morphologically-rich languages is quite challenging and needs more active research. In this thesis, we suggest one text chunking method and two data-driven dependency parsing methods for morphologically-rich and head-final languages (MRHFLs). This means that we will consider head-final property of head-final languages (HFLs) as well as morphological richness of morphologically-rich languages (MRLs). Korean and Japanese are used as a case study for the proposed model. First, text chunking method using conditional random fields (CRFs) are proposed as a preprocessing step. Although a rule-based chunking model is predominantly used for its simplicity and efficiency in Korean, it shows weakness in handling exceptional cases of the rules. Compared to the rule-based model, the proposed model is robust and accurate because of CRFs which can find the globally optimal solution with many correlated, overlapping features of the input sentence. Second, we present a graph-based dependency parsing model using dynamic features. The dynamic features, which are derived from the partial tree during the bottom-up chart parsing, play an important role in finding the correct heads. Based on the beneficial aspects of the head-final property of Korean, we suggest a variant of the CYK parsing algorithm with an O(n^3) complexity which has the ability to search for the maximum spanning tree (MST) from all projective trees. Compared to Eisner's algorithm, this algorithm is time-efficient in parsing all other head-final languages. We also explain why the graph-based approaches are seldom attempted in parsing morphologically rich languages (MRLs) and suggest a morpheme-level dependency representation as a solution. Third, we propose the new transition-based dependency parsing model using a reinforced dependency check. Basically, traditional transition-based parsing derives a dependency tree by sequentially determining the dependency between two words. Although it takes into account some contextual information as well as two target words, it is still extremely local and greedy. If a pre-determined dependency is wrong, it can lead to error propagation and a performance drop. To avoid incorrect dependency predictions, the proposed method enforces the dependency checking process by using one more candidate head which helps to confirm whether two target words really have a dependency relation. We also use passive dependency confirmation by referring some decisions instead of one decision. In experimental evaluation on the KAIST Treebank for Korean and Kyoto Text Corpus for Japanese, the proposed models achieved state-of-the-art performance. We believe they can be easily applied to other MRHFLs and obtain comparable performance. The further improvement by integrating proposed models with linguistic knowledge is left as future research.

URI: http://postech.dcollection.net/jsp/common/DcLoOrgPer.jsp?sItemId=000001095209
https://oasis.postech.ac.kr/handle/2014.oak/1228

Article Type: Thesis

Files in This Item:: There are no files associated with this item.

Show full item record

qr_code

트윗하기

Communities & Collection

Department of Computer Science & Engineering (컴퓨터공학과)

Open Access System for Information Sharing

Communities & Collection

Views & Downloads

Browse