作 者:毛文伟[1] MAO Wen-wei ( Shanghai International Studies University, Shanghai 201783, China) 机构地区:[1]上海外国语学院,上海201783 出 处:《外语电化教学》2017年第3期 10-14,共5页Media in Foreign Language Instruction 基 金:本文是2017年教育部人文社会科学探讨青年基金项目课题“中国日语学习者表达失误的认知语言学探讨(12YJC740076)”和上海外国语学院青年科研创新团队项目“基于日语学习者语料库的二语习得探讨(QJTD11MWW01)”的探讨成果. 摘 要:词性自动赋码技术的日臻成熟为语料库建设提供了有力支撑。与本族语语料不同,学习者产出中充斥着大量错误。这必然会对赋码的准确性造成干扰。因此,日语论文题目,除了精度以外,抗干扰能力也是需要着重考虑的因素。本文统计并比较了日语开源自动词性赋码器对学习者语料赋码的精度以及赋码信度与语料质量的相关性。从中发现,MeCab表现最出色,ChaSen次之,JUMAN则稍逊一筹。此外,日语论文,探讨证实,日语开源赋码器对学习者语料赋码的精度甚至超过了本族语语料。因此,完全可以充当语料库建设的可靠工具。The automatic POS tagging technology has matured to provide a strong support for the corpus building. Unlike the native speaker' s corpus, the learner' s outputs are flooded with errors. This will definitely interfere with the accuracy of the tagging. Therefore, in addition to accuracy, the anti-interference ability should also be taken into account. This paper focuses on the Japanese open-source automatic POS taggers, calculates the accuracy when they are used to tag a group of the learner' s texts and observes whether the performance are affected by the quality of texts. Results of the study indicate that MeCab is the best and ChaSen acts better than JUMAN. It is also proved that the accuracy of the learner' s corpus tagging is even better than the performance when they are used to tag the native speaker' s corpus. Therefore, the taggers can be used as a powerful tool during the construction of learner' s corpus. 关 键 词:语料库 赋码 隐马尔科夫模型 日语 分 类 号:H319.3[语言文字—英语] |