한국어 무제한 어휘 연속음성 인식 시스템을 위한 서브워드 인식 단위 자동 생성 [韩语论文]-外语论文网

This thesis proposes an automatic generation method of sub-word recognition units for an unlimited vocabulary Korean continuous speech recognition system. Recognition units in conventional Korean continuous speech recognition systems are typically categorized as word-based (full-word) and morpheme-based (sub-word) units. The main disadvantage of these recognition units is that combinations of the units cannot cover all words in the Korean lexicon. The proposed method for automatic generation of sub-word units begins with its initial set of sub-word units. This set consists of all possible syllables in Korean. The frequencies of current sub-word unit pairs found in the training data are then counted. The pair of sub-word units with the highest frequency is added to the current sub-word unit set. This process is repeated until the number of sub-word units reaches a pre-defined limit.

The system performance was analyzed by using about 10,000 sentences to test the recognition units of full-words, morphemes and the proposed sub-words. The overall accuracy reached 64.75%, 71.12%, and 70%, for full-words, morphemes and the proposed sub-word units, respectively. However, when there are two or more Out-of-Vocabulary (OOV) words in a sentence, the system based on the proposed sub-word unit showed the best accuracy. The result demonstrated that the proposed sub-word units are the most robust units in the presence of OOVs.

본 논문은 한국어 무제한 어휘 연속음성 인식 시스 템을 위한 sub-word 인식 단위 자동 생성 방법을 제안한다. 일반적인 한국어 음성 인식 시스템에서 사용하는 인식 단위는 주로 full-word인 어절...

본 논문은 한국어 무제한 어휘 연속음성 인식 시스템을 위한 sub-word 인식 단위 자동 생성 방법을 제안한다. 일반적인 한국어 음성 인식 시스템에서 사용하는 인식 단위는 주로 full-word인 어절과 sub-word인 형태소 기반 단위로 구분할 수 있다. 이들 인식 단위의 약점은, 인식 단위의 조합으로 한국어의 모든 단어를 커버할 수 없다는 것이다. 제안하는 sub-word 인식 단위 자동 생성 방법은 한국어에 존재하는 모든 음절들로 이루어진 초기 sub-word 집합을 구성하는 것으로 시작한다. 그 다음, 현재의 sub-word 쌍들이 학습 자료 내에 등장하는 빈도를 세어, 가장 높은 빈도수를 갖는 sub-word 쌍을 합쳐 현재 sub-word 집합에 추가한다. 이 과정을 sub-word의 수가 미리 정한 제한에 달할 때까지 반복한다.

어절, 형태소, 제안한 sub-word 인식 단위에 대하여 약 10,000개의 문장을 이용해 시스템 성능을 분석하였다. 전체 accuracy는 어절에 대해 64.75%, 형태소에 대해 71.12%, 그리고 제안한 sub-word에 대해 70%였다. 그러나, OOV 단어가 둘 이상 등장하는 문장에 대해서는 제안한 sub-word를 인식 단위로 사용한 경우가 가장 높은 성능을 보였다. 이 결과로부터 제안한 sub-word 단위가 OOV의 존재에 대해 가장 강인한 인식 단위임을 알 수 있다.

참고문헌 (Reference)

활용도 분석

View

Usage

이 자료의 주제 내 활용도 Top
이 자료의 주제 내 View Top
이 자료의 주제 내 Usage Top
이 자료의 주제 내 Share Top

※ 각 수치는 매주 업데이트됨

，韩语毕业论文，韩语论文网站

TV 포맷의 새로운 유형화 : 이야기, 놀이	중국인 학습자를 위한 한국어 거절 화행	항공사의 지각된 서비스품질이 실용적
깔뱅의 기도론 연구	모야모야 환아의 수술 후 자기효능감,	형태 초점 접근법을 활용한 한국어 대조
한·중 사동 표현의 대조 연구	韩国跆拳道运动的文化价值观探讨	韩国电影剧本中会话含义的略论探讨
영어권 학습자를 위한 한국어 교재 구성	汉韩常用颜色词对比探讨	高职院校韩语系建设的几点思考
도시지역 여성결혼이민자의 재사회화	영어 문장구조에 대한 이해가 읽기와 듣	한국과 독일의 중등교육단계에서의 진로