全球化的配景下,教导、经济、文明等运动的展开逾越了国界。一方面,互联网的蓬勃成长推进着全球化的过程,另外一方面,说话也许正在成为全球化最初的妨碍。主动化说话辨认技巧就在这类配景下不温不火地向前成长。笔者对国际外文本主动分类、机械翻译、多说话信息检索等相干范畴的研究停止了具体的查询拜访研究。业界广泛赞成将说话辨认成绩看做是“基于某些特点停止文天职类”这一成绩的一个特例。文天职类相干研究自上世纪60年月贝叶斯几率分类器问世起至今,年夜致阅历了文本主动分类、人工帮助分类和机械进修三个阶段。一些统计分类算法,诸如KNN算法、决议计划树、Rocchio算法、朴实贝叶斯算法、支撑向量机、最年夜熵模子、遗传算法、神经收集等等,均在今朝文本主动分类的研究中表示精彩。而作为机械进修重点研究范畴之一的机械翻译挑起了今朝绝年夜多半多说话信息检索体系焦点模块的重任。机械翻译借助辞典、语料库、本体和在此基本之上构建的谷歌在线翻译、互联网通行机械翻译体系(Internet Passport MT System)和在线世界语机械翻译体系(Online WorldLingo MT System)等收费可得的对象,来完成查询词和多种说话情势的待检文档之间的沟通。多说话主动辨认作为机械翻译的前导,是今朝被广泛疏忽却又对多说话信息检索成果有主要作用的研究范畴。对于说话主动辨认这一范畴,其面对的成绩与其说属于文天职类研究范畴,韩语论文,不如说属于天然说话处置的研究领域。本文完成的多说话主动辨认法式,就是基于天然说话处置中有名的N一Gram实际之上。N一Gram是一种几率统计说话模子,又被称为一阶马尔可夫链。这一实际重要应用于词性标注、音字转换和语音说话辨认。特别在语音辨认范畴,它被以为是今朝完成疾速准确语音辨认体系最胜利的办法。本文运用它停止以文本情势存在的说话主动辨认研究。研究对象为汉语、英语、法语、德语、俄语和日语韩语等互联网运用最普遍的七种说话。多说话辨认试验分为练习多语种语料库和语种辨认两个阶段,练习和测试文本均来自于开放式目次工程(Open Directory Project)。辨认试验成果证实,该法式对英语和德语长短文本的均匀辨认准确率最高,均为100%,俄语其次,为94。44%,接上去顺次是中文简体94。44%,中文繁体83。33%,法语83。33%,韩语16。67%,若消除汉语语词特点作用,韩语可被精确辨认。试验进一步拔取日文中罕见的两种编码EUC一JP和SHIFT一JIS,按照上述练习和辨认两个步调,对N一Gram实际应用于编码辨认时的有用性停止了摸索性的验证,并获得了使人欣喜的成果。对EUC一JP和SHIFT一JIS编码的准确辨认比例分离为85%和95%,辨认误差率均低于0。0020。运用N一Gram实际停止编码辨认是本文的一个亮点。随后,笔者引入全文检索框架Lucene3。5,联合其焦点代码,引见了多说话辨认相干的索引模块和搜刮模块的任务道理,剖析Analyzer内建类。并根据索引、搜刮模块的相干接口对上述说话辨认法式停止了细节处置,韩语论文,将中文简体和中文繁体的辨认成果同一前往为“Chinese”类型,日语、韩语辨认成果同一前往为“CJK”类型。由此将多说话主动辨认法式扩大为Lucene3。5的多说话主动辨认模块,分离在树立索引和用户检索两个阶段交叉多说话辨认功效,以期协助Lucene完成跨说话检索体系的开辟,和腻滑用户的跨说话检索体验。这项任务今朝还没有发明有研究者涉足。因为篇幅和时光的限制,仅在文中给出模块及其接口设计,完成基于Lucene的多说话检索体系将是下一阶段的研究义务。 Abstract: Under the background of globalization, education, economy, civilization movement beyond the borders. On the one hand, the Internet vigorous growth promoting the process of globalization, on the other hand, may speak of globalization is becoming the first obstruction. Active speak recognition skills in this kind of background of tepid growth ahead. The author of the international classification, machine translation, text active talk information retrieval and other research related field of the detailed investigation and study. The industry widely in favor of speech recognition performance is regarded as "based on certain characteristics of stop Wentian job" as a special case of this result. Wentian vocational class relevant research since 1960s Bayesian probability classifier came from so far, the eve of the experience of the text automatic classification, artificial help classification and machine learning in three stages. Some statistical classification algorithms, such as KNN algorithm, decision tree, the Rocchio algorithm, naive Bayes algorithm, support vector machine, the maximum entropy model, genetic algorithm, neural network and so on, in the current text automatic classification research said wonderful. As a mechanical translation machine study focused on category of picking up at present most branches of multi task information retrieval system to focus module. Mechanical translation with the help of dictionary and corpus, the main body, and on this basis to construct the Google online translation, Internet traffic charging machine translation system (internet passport MT system) and online world language machine translation system (online WorldLingo MT system) etc. the object, to complete queries and a variety of language forms of communication between inspection documents. Speak automatic identification machine translation as a leader, is now being widely research neglect but has a major impact on the information retrieval results speak. About talking to automatically identify the category, face the results rather than belongs to Wentian vocational study category, rather than belongs to the research field of natural language processing. How to speak French in this automatic identification, is based on the famous N Gram in a practical natural language disposal. N Gram is a kind of statistical speaking mold, also known as a first-order Markov chain. This is a real important used in POS tagging and word conversion and speech recognition. Especially in the category of speech recognition, it is thought that the current rapid complete accurate speech recognition system the most successful way. In this paper it ceases to exist in text speak automatic identification research situation. The research object is seven speak Chinese, English, French, German, Russian and Japanese, Korean and other Internet applications, the most common. Speak identification test as an exercise in multilingual corpora and language identification in two stages, exercises and test text from in the Open Directory Project (Open Directory Project). The test results confirmed the identification of French, English and German text length uniform identify the highest accuracy rate was 100%, 94 for the Russian second. 44%, the next order is Chinese simplified 94. 44%, Chinese traditional 83. 33%, French 83. 33%, Korean 16. 67%, if the elimination of Chinese words characteristics of Korean can be accurately identified. A rare test further the adoption of Japanese two coding EUC, JP and shift JIS, in accordance with the practice and identify two steps, n a gram of practical application in code to identify the usefulness of gropingly verification, and achieved gratifying results. EUC JP and SHIFT JIS encoding to identify the proportion were 85% and 95%, to identify the error rate is less than 0. 0020. The application of N Gram to stop the actual encoding identification is a highlight of this paper. Then, the author introduces the full-text retrieval system Lucene3. 5. Combined with the focus of the code, introduced the talk much to identify coherent index module and search module task truth and analysis analyzer within the building. And according to index and search module of coherent interfaces of the talk to identify French stop the details of the deal, the Chinese simplified and traditional Chinese recognition results with a headed for "Chinese" type, the Japanese and Korean recognition results with a headed for CJK types. This will speak French expansion for the Lucene3 automatic identification. 5 more active speech recognition module, separation in the set indexing and user retrieval two stage cross talk much recognition effect, in order to assist with Lucene complete cross talking retrieval system development, and greasy slippery users cross talking retrieval experience. This task is not being a researcher involved in the invention. Because of space and time constraints, only given module and interface design, based on Lucene. The retrieval system will be the next stage of the research task. 目录: |