Q45 : Web Document Clustering Using Document Index Graph
Thesis > Central Library of Shahrood University > Computer Engineering > MSc > 2013
Authors:
Narjes Ramazani Oomali [Author], Morteza Zahedi[Supervisor], Prof. Hamid Hassanpour[Advisor]
Abstarct: With the tremendous growth of the World Wide Web, many research projects were targeted on how to organize such information in a way that will make it easier for the end users to find the information they want efficiently and accurately. Document Clustering is an important tool for many Information Retrieval (IR) tasks. Document clustering techniques mostly rely on single term analysis of the document data set, such as the Vector Space Model. To achieve more accurate document clustering, more informative features including phrases and their weights are particularly important in such scenarios. Although other techniques such as suffix tree find out any length common phrases between documents but it suffers from high redundancies stored in the form of suffixes. The new model Phrase-baxsed Document Index graph is proposed in 2004. This model incrementally constructs a graph by phrase-baxsed indexing the document set; this allows us to make use of more informative phrase matching rather than individual words matching which provides efficient similarity calculation between documents. This model has no redundancy and supports any number of documents in clustering. This efficient performance of construction/phrase-matching lends itself to online incremental processing, such as processing the results of a Web search engine retrieved list of documents. The quality of the clusters produced using this system was higher than those produced using traditional clustering methods. This thesis is studying on different methods of document clustering and discusses their strengths and weaknesses, then focuses on new proposed document clustering system and its advantages over previous methods. Since that this new system is able to be used in a search engine to cluster retrieved documents, we consider operation of this system from search engine point of view and try to improve its efficiency as a part of search engine structure. Search engine often considers the rate of visiting documents to create the order of the search results displayed for user. By using proposed system in search engine and adding weights to nodes and edges of the graph we can compute the weight of the search phrase in different documents and sort them baxsed on phrase weight, this causes user achieve her/his desired information more accurately and quickly. To adding weights by modifying graph structure, for each document compute node weights by counting and edge weights by using a single laxyer-perceptron and improve the System Performance as a part of search engine structure.
Keywords:
#Document clustering #phrase-baxsed indexing #document index graph #fuzzy graph. Link
Keeping place: Central Library of Shahrood University
Visitor: