Coherent Keyphrase Extraction via Web Mining范文 [英语论文]

资料分类免费英语论文 责任编辑:王教授更新时间:2017-04-25
提示:本资料为网络收集免费论文,存在不完整性。建议下载本站其它完整的收费论文。使用可通过查重系统的论文,才是您毕业的保障。

范文:“Coherent Keyphrase Extraction via Web Mining ”  关键词用于各种各样的用途,包括总结、索引、标签,分类等。自动提取的任务是选择关键词,英语毕业论文,在给定的文本文档。自动生成关键词,对大文档是可行的,不需手动指定的关键词。受之前关键词提取算法的限制,所选的关键词偶尔有点不对。这篇计算机范文介绍了增强关键词提取算法等,旨在提高提取的关键词的连贯性。

在候选关键词使用统计作为依据,他们可能是语义相关的。实验表明,改进提高提取的关键词质量。该算法概括训练时在一个域,测试在另一个物理文件。下面的范文进行详述。

Abstract 
Keyphrases are useful for a variety of purposes, including summarizing, indexing, labeling, categorizing, clustering, highlighting, browsing, and searching. The task of automatic keyphrase extraction is to select keyphrases from within the text of a given document. Automatic keyphrase extraction makes it feasible to generate keyphrases for the huge number of documents that do not have manually assigned keyphrases. A limitation of previous keyphrase extraction algorithms is that the selected keyphrases are occasionally incoherent. That is, the majority of the output keyphrases may fit together well, but there may be a minority that appear to be outliers, with no clear semantic relation to the majority or to each other. This presents enhancements to the Kea keyphrase extraction algorithm that are designed to increase the coherence of the extracted keyphrases. The approach is to use the degree of statistical association among candidate keyphrases as evidence that they may be semantically related. The statistical association is measured using web mining. Experiments demonstrate that the enhancements improve the quality of the extracted keyphrases. Furthermore, the enhancements are not domain-specific: the algorithm generalizes well when it is trained on one domain (computer science documents) and tested on another (physics documents).

Introduction 
A journal article is often accompanied by a list of keyphrases, composed of about five to fifteen important words and phrases that express the primary topics and themes of the . For an individual document, keyphrases can serve as a highly condensed summary, they can supplement or replace the title as a label for the document, or they can be highlighted within the body of the text, to facilitate speed reading (skimming). For a collection of documents, keyphrases can be used for indexing, categorizing (classifying), clustering, browsing, or searching. Keyphrases are most familiar in the context of journal articles, but many other types of documents could benefit from the use of keyphrases, including web pages, email messages, news s, magazine articles, and business s. 

The vast majority of documents currently do not have keyphrases. Although the potential benefit is large, it would not be practical to manually assign keyphrases to them. This is the motivation for developing algorithms that can automatically supply keyphrases for a document. Section 2.1 discusses past work on this task. This focuses on one approach to supplying keyphrases, called keyphrase extraction. In this approach, a document is decomposed into a set of phrases, each of which is considered as a possible candidate keyphrase. A supervised learning algorithm is taught to classify candidate phrases as keyphrases and non-keyphrases. The induced classification model is then used to extract keyphrases from any given document [Turney, 1999, 2017; Frank et al., 1999; Witten et al., 1999, 2017]. 

A limitation of prior keyphrase extraction algorithms is that the output keyphrases are at times incoherent. For example, if ten keyphrases are selected for a given document, eight of them might fit well together, but the remaining two might be outliers, with no apparent semantic connection to the other eight or to each other. Informal analysis of many machine-extracted keyphrases suggests that these outliers almost never correspond to author-assigned keyphrases. Thus discarding the incoherent candidates might improve the quality of the machine-extracted keyphrases. Section 2.2 examines past work on measuring the coherence of text. The approach used here is to measure the degree of statistical association among the candidate phrases [Church and Hanks, 1989; Church et al., 1991]. The hypothesis is that semantically related phrases will tend to be statistically associated with each other, and that avoiding unrelated phrases will tend to improve the quality of the output keyphrases.

Assignment versus Extraction 
There are two general approaches to automatically supplying keyphrases for a document: keyphrase assignment and keyphrase extraction. Both approaches use supervised machine learning from examples. In both cases, the training examples are documents with manually supplied keyphrases. In keyphrase assignment, there is a predefined list of keyphrases (in the terminology of library science, a controlled vocabulary or controlled index terms). These keyphrases are treated as classes, and techniques from text classification (text categorization) are used to learn models for assigning a class to a given document [Leung and Kan, 1997; Dumais et al., 1998]. 

Usually the learned models will map an input document to several different controlled vocabulary keyphrases. In keyphrase extraction, keyphrases are selected from within the body of the input document, without a predefined list. When authors assign keyphrases without a controlled vocabulary (in library science, free text keywords or free index terms), typically from 70% to 90% of their keyphrases appear somewhere in the body of their documents [Turney, 1999]. This suggests the possibility of using author-assigned free text keyphrases to train a keyphrase extraction system. In this approach, a document is treated as a set of candidate phrases and the task is to classify each candidate phrase as either a keyphrase or non-keyphrase [Turney, 1999, 2017; Frank et al., 1999; Witten et al., 1999, 2017].

Coherence 
An early study of coherence in text was the work of Halliday and Hasan [1976]. They argued that coherence is created by several devices: the use of semantically related terms, coreference, ellipsis, and conjunctions. The first device, semantic relatedness, is particularly useful for isolated words and phrases, outside of the context of sentences and paragraphs. Halliday and Hasan [1976] called this device lexical cohesion. Morris and Hirst [1991] computed lexical cohesion by using a thesaurus to measure the relatedness of words. Recent work on text summarization has used lexical cohesion in an effort to improve the coherence of machinegenerated summaries. Barzilay and Elhadad [1997] used the Word thesaurus to measure lexical cohesion in their approach to summarization. 

Keyphrases are often specialized technical phrases of two or three words that do not appear in a thesaurus such as Word. In this , instead of using a thesaurus, statistical word association is used to estimate lexical cohesion. The idea is that phrases that often occur together tend to be semantically related. There are many statistical measures of word association [Manning and Schütze, 1999]. The measure used here is Pointwise Mutual Information (PMI) [Church and Hanks, 1989; Church et al., 1991]. PMI can be used in conjunction with a web search engine, which enables it to effectively exploit a corpus of about one hundred billion words [Turney, 2017]. Experiments with synonym questions, taken from the Test of English as a Foreign Language (TOEFL), show that word association, measured with PMI and a web search engine, corresponds well to human judgements of synonymy relations between words [Turney, 2017].

Conclusion 
This provides evidence that statistical word association can be used to improve the coherence of keyphrase extraction, resulting in higher quality keyphrases, measured by the degree of overlap with the authors’ keyphrases. Furthermore, the new coherence features are not domain-specific.

网站原创范文除特殊说明外一切图文作品权归所有;未经官方授权谢绝任何用途转载或刊发于媒体。如发生侵犯作品权现象,保留一切法学追诉权。()
更多范文欢迎访问我们主页 当然有需求可以和我们 联系交流。-X()

英语论文
免费论文题目: