网范文:“Neural works in the Classification of Training Web Pages ” 网络分类一直试图通过许多不同的技术。在这项探讨中,我们专注于神经网络(NN)的比较,引入一个增强的数据样本,英语毕业论文,通过神经网络分类器来确定分类的成功率。这个探讨表明我们不仅优于传统的分类。这篇范文显示,尽管它比我们所遇到的分类更少,管理大量的网络信息分类,我们所需要的是一个能够使用这些信息的方式。
这也就不足为奇,网络的普及分类不仅适用于学术需求,不断增长的知识,但也需要工业快速、高效的解决方案,英语论文,保持最新的信息收集和略论,对业务成长是至关重要的信息。这项探讨是一个更大的探讨项目的一部分,这有助于其他组织识别和略论需求。下面的范文进行讲述。
Abstract
Web classification has been attempted through many different technologies. In this study we concentrate on the comparison of Neural works (NN), Naïve Bayes (NB) and Decision Tree (DT) classifiers for the automatic analysis and classification of attribute data from training course web pages. We introduce an enhanced NB classifier and run the same data sample through the DT and NN classifiers to determine the success rate of our classifier in the training courses domain. This research shows that our enhanced NB classifier not only outperforms the traditional NB classifier, but also performs similarly as good, if not better, than some more popular, rival techniques. This also shows that, overall, our NB classifier is the best choice for the training courses domain, achieving an impressive F-Measure value of over 97%, despite it being trained with fewer samples than any of the classification systems we have encountered.
Keywords: Web classification, Naïve Bayesian Classifier, Decision Tree Classifier, Neural work Classifier, Supervised learning
Introduction
Managing the vast amount of online information and classifying it into what could be relevant to our needs is an important step towards being able to use this information. Thus, it comes as no surprise that the popularity of Web Classification applies not only to the academic needs for continuous knowledge growth, but also to the needs of industry for quick, efficient solutions to information gathering and analysis in maintaining up-to-date information that is critical to the business success. This research is part of a larger research project in collaboration with an independent brokerage organisation, Apricot Training Management (ATM), which helps other organisations to identify and analyse their training needs and recommend suitable courses for their employees.
Currently, the latest prospectuses from different training providers are ordered, catalogued, shelved and the course information found is manually entered into the company’s database. This is a time consuming, labour-intensive process, which does not guarantee always up-to-date results, due to the limited life expectancy of some course information such as dates and prices and other limitations in the availability of up-to-date, accurate information on websites and printed literature. The overall project is therefore to automate the process of retrieving, extracting and storing course information into the database guaranteeing it is always kept up-to-date.
The research presented in this is related to the information retrieval side of the project, in particular to the automatic analysis and filtering of the retrieved web pages according to their relevance. This classification process is vital to the efficiency of the overall system, as only relevant pages will then be considered by the extraction process, thus drastically reducing processing time & increasing accuracy. The underlining technique used for our classifier is based on the NB algorithm, due to the independence noticed in the data corpus analysed. The traditional technique is enhanced however, to analyse not only the visible textual content of web pages, but also important web structures such as META data, TITLE and LINK information. Additionally, a ‘believed probability’ of features in each category is calculated to handle situations when there is little evidence about the data, particularly in the early stages of the classification process. Experiments have shown that our classifier exceeds expectations, achieving an impressive F-Measure value of over 97%.
Related Work
Many ideas have emerged over the years on how to achieve quality results from Web Classification systems, thus there are different approaches that can be used to a degree such as Clustering, NB and Bayesian works, NNs, DTs, Support Vector Machines (SVMs) etc. We decided to only concentrate on NN, DT and NB classifiers, as they proved more closely applicable to our project. Despite the benefits of other approaches, our research is in collaboration with a small organisation, thus we had to consider the organisation’s hardware and software limitations before deciding on a classification technique. SVM and Clustering would be too expensive and processor intensive for the organisation, thus they were considered inappropriate for this project. The following discusses the pros and cons of NB, DTs and NNs, as well as related research works in each field.
Naïve Bayes Models
NB models are popular in machine learning applications, due to their simplicity in allowing each attribute to contribute towards the final decision equally and independently from the other attributes. This simplicity equates to computational efficiency, which makes NB techniques attractive and suitable for many domains. However, the very same thing that makes them popular, is also the reason given by some researchers, who consider this approach to be weak. The conditional independence assumption is strong, and makes NB-based systems incapable of using two or more pieces of evidence together, however, used in appropriate domains, they offer quick training, fast data analysis and decision making, as well as straightforward interpretation of test results.
There is some research ([13], [26]) trying to relax the conditional independence assumption by introducing latent variables in their tree-shaped or hierarchical NB classifiers. However, a thorough analysis of a large number of training web pages has shown us that the features used in these pages can be independently examined to compute the category for each page. Thus, the domain for our research can easily be analysed using NB classifiers, however, in order to increase the system’s accuracy, the classifier has been enhanced as described in section 3. Enhancing the standard NB rule or using it in collaboration with other techniques has also been attempted by other researchers. Addin et al in [1] coupled a NB classifier with K-Means clustering to simulate damage detection in engineering materials. NBTree in [24] induced a hybrid of NB and DTs by using the Bayes rule to construct the decision tree. Other research works ([5], [23]) have modified their NB classifiers to learn from positive and unlabeled examples. Their assumption is that finding negative examples is very difficult for certain domains, particularly in the medical industry. Finding negative examples for the training courses domain, however, is not at all difficult, thus the above is not an issue for our research.
Decision Trees
Unlike NB classifiers, DT classifiers can cope with combinations of terms and can produce impressive results for some domains. However, training a DT classifier is quite complex and they can get out of hand with the number of nodes created in some cases. According to [17], with six Boolean attributes there would be need for 18,446,744,073,709,551,616 distinct nodes. Decision trees may be computationally expensive for certain domains, however, they make up for it by offering a genuine simplicity of interpreting models, and helping to consider the most important factors in a dataset first by placing them at the top of the tree.
The researchers in [7], [12], [15] all used DTs to allow for both the structure and the content of each web page to determine the category in which they belong. An accuracy of under 85% accuracy was achieved by all. This idea is very similar to our work, as our classifier also analyses both structure and content. WebClass in [12] was designed to search geographically distributed groups of people, who share common interests. WebClass modifies the standard decision tree approach by associating the tree root node with only the keywords found, depth-one nodes with descriptions and depth-two nodes with the hyperlinks found. The system however, only achieved 73% accuracy. The second version of WebClass ([2]) implemented various classification models such as: Bayes networks, DTs, K-Means clustering and SVMs in order to compare findings of WebClassII. However, findings showed that for increasing feature set sizes, the overall recall fell to just 39.75%.
Neural works
NNs are powerful techniques for representing complex relationships between inputs and outputs. Based on the neural structure of the brain ([17]), NNs are complicated and they can be enormous for certain domains, containing a large number of nodes and synapses. There is research that has managed to convert NNs into sets of rules in order to discover what the NN has learnt ([8], [21]), however, many other works still refer to NNs as a ‘black box’ approach ([18], [19]), due to the difficulty in understanding the decision making process of the NN, which can lead to not knowing if testing has succeeded. AIRS in [4] used the knowledge acquired during the training of a NN to modify the user’s query, making it possible for the adapted query to retrieve more documents than the original query.
However, this process would sometimes give more importance to the knowledge ‘learnt’, thus change the original query until it lost its initial keywords. Researchers in [6] and [14] proposed a term frequency method to select the feature vectors for the classification of documents using NNs. A much later research ([3]) used NNs together with an SVM for better classification performance. The content of each web page was analysed together with the content of its neighbouring pages. The resulting feature scores were also used by the SVM. Using two powerful techniques may radically improve classification; however, this research did not combine the techniques to create a more sophisticated one. They simply used them one after the other on the same data set, which meant that the system took much longer to come up with results.
NB Classifier
Our system involves three main stages (Fig. 1). In stage-1, a CRAWLER was developed to find and retrieve web pages in a breadth-first search manner, carefully checking each link for format accuracies, duplication and against an automatically updatable rejection list. In stage-2, a TRAIR was developed to analyse a list of relevant (training pages) and irrelevant links and compute probabilities about the feature-category pairs found. After each training session, features become more strongly associated with the different categories.
The training results were then used by the NB Classifier developed in stage-3, which takes into account the ‘knowledge’ accumulated during training and uses this to make intelligent decisions when classifying new, unseenbefore web pages. The second and third stages have a very important sub-stage in common, the INDEXER. This is responsible for identifying and extracting all suitable features from each web page. The INDEXER also applies rules to reject HTML formatting and features that are ineffective in distinguishing web pages from one-another. This is achieved through sophisticated regular expressions and functions which clean, tokenise and stem the content of each page prior to the classification process.
Features that are believed to be too common or too insignificant in distinguishing web pages from one another, otherwise known as stopwords, are also removed. Care is taken, however, to preserve the information extracted from certain Web structures such as the page TITLE, the LINK and META tag information. These are given higher weights than the rest of the text, as we believe that the information given by these structures is more closely related to the central theme of the web page.
Once the probabilities for each category have been calculated, the probability values are compared to each other. The category with the highest probability, and within a predefined threshold value, is assigned to the web page being classified. All the features extracted from this page are also paired up with the resulting category and the information in the database is updated to expand the system’s knowledge.
Results
Performance Measures The experiments in this research are evaluated using the standard metrics of accuracy, precision, recall and fmeasure for Web Classification. These were calculated using the predictive classification table, known as Confusion Matrix (Table 1). The F-Measure was used, because despite Precision and Recall being valid metrics in their own right, one can be optimised at the expense of the other ([22]). The FMeasure only produces a high result when Precision and Recall are both balanced, thus this is very significant. A Receiver Operating Characteristic (ROC) curve analysis was also performed, as it shows the sensitivity (FN classifications) and specificity (FP classifications) of a test. The ROC curve is a comparison of two characteristics: TPR (true positive rate) and FPR (false positive rate). The TPR measures the number of relevant pages that were correctly identified.()
网站原创范文除特殊说明外一切图文作品权归所有;未经官方授权谢绝任何用途转载或刊发于媒体。如发生侵犯作品权现象,保留一切法学追诉权。
更多范文欢迎访问我们主页 当然有需求可以和我们24小时在线客服 20171 关系交流。-X()
|