XFST를 이용한 한국어 형태소 분석기 구축 [韩语论文]

资料分类免费韩语论文 责任编辑:金一助教更新时间:2017-04-27

Building a Korean morphological analyzer with XFST This aims to construct a new morpheme analyzer that reflects clitics. this study start from the observation that there is a syntactically word clitics among the elements that have been analyzed...

Building a Korean morphological analyzer with XFST This aims to construct a new morpheme analyzer that reflects clitics. this study start from the observation that there is a syntactically word clitics among the elements that have been analyzed as affixes in existing morpheme analyzer. To construct a morpheme analyzer that reflects the clitics, two tasks must be performed; First, the problem of establishing a new part-of-speech system for a morphological analyzer that reflects the clitics that have been considered as the existing derivational affixes, Second, the problem of implementing the morpheme analyzer that reflects the clitic system The classification of the clitics among the elements analyzed by the derivational affix was carried out by analyzing the examples of 1.6 million word segregation morpheme analysis corpus. Specifically, in the elements that were tagged with affix in the morpheme analysis corpus, it is classified as a clitic in the following cases - ① the external element can be inserted between the root and the affix, ② the root forms the coordination with other external elements, ③ only the root except the affix is solely modified by the external elements ④ When the affix is combined with a phrase or clause. Based on this classification, a new morpheme part-of-speech system that reflects the clitics were set up. The Xerox Finite State Tool (XFST) was used for the implementation of the morphological analyzer reflecting the clitics. The morphological analyzer using the XFST is constructed by composing Lexicon Finite State Transducer (FST), which implements dictionary and word formation rules, and Rule Finite State Transducer(FST), which implements various morphological transformation phenomena. The current Leixcon FST is made up of 108 different sub-dictionaries, 3700 entries, and 50 tag-sets. Rule FST was created by defining morphotactic filter rules related to morpheme combination and phonological and orthographic alternation rules related to irregular verbs, deletion, and abbreviation in Korean language. As a result of the construction, the classification of the derivational affix and the clitic which was overlooked in the existing morpheme analyzer was successfully analyzed. Furthermore, the clitics were analyzed into the clitic adnominal, clitic noun, and clitic adjective following the distribution characteristics. Additionally, separating postposition and ending from the stem, restoration of the omitted copula ‘i’, separation of the suffix, restoring the original form of the irregular verb before transformation, handling number words that include cardinals, ordinals, as well as number, required for Korean morpheme analysis system, are possible. The Korean morphological analyzer, which reflects the clitic system, can perform sophisticated analysis on the elements considered as words in Korean.
