Recently, to overcome limitation of rule-based machine translation, many researchers have studied about statistical machine translation. Statistical machine translation is the method for deciphering an input document, written in a source language, usi...
Recently, to overcome limitation of rule-based machine translation, many researchers have studied about statistical machine translation. Statistical machine translation is the method for deciphering an input document, written in a source language, using probabilities. For training, we get to conditional probabilities for words of two languages from parallel corpora, consisted of a set of pairs of two sentences written in different languages but these are same meaning, and we get to context probabilities from a target language. In this process, we need to a lot of parallel corpora for the good result of translation, but it needs a lot of times to collect parallel corpora manually. But it is very easy to collect bilingual corpora, we need to sentence alignment for converting from bilingual corpora to parallel corpora automatically.
Sentence alignment is a task to find to the corresponding sentence between two documents which consists of different languages. The traditional way is the length-based method. This method only depends on the fact that the lengths of aligned sentences in a source and target language are highly correlated. So it cannot guarantee same meaning sentence about result of sentence alignment. For solving this problem, the lexical-based method, used lexical information within input documents, is proposed. But this method is very slower than the length-based method. And it cannot guarantee good result for different languages which have different language’s structures, like to Korean and English. For solving this problem, others use to bilingual dictionary instead of lexical information within input documents. This method cannot guarantee a good result if the document is appeared that multiple words of a source language correspond to one word of a target language, vice versa.
In this , for solving the problems of previous sentence alignment, we propose a new method that combines length based method and lexical information. The proposed method is follows: (1) We translate a source document and a target document into English using the existing machine translation system. (2) We use a monolingual sentence alignment method. In this method, we use lexical information instead of case penalty of beads. Then (3) we convert the result of (2) into an original source language and target language.
As a result, in sentence alignment between Korean and English, we can see the performance of 96.20% using the F-1 measure. This result is higher than all of previous method. Also, to prove generality on this method, we experimented on multilingual language pairs, consisted of total 34 pairs. In this experiment, we can see that our method have about 2.27% higher than previous length-based methods on average.
,韩语论文,韩语论文网站 |