Q174 : Web Spam Detection using Qualification Features
Thesis > Central Library of Shahrood University > Computer Engineering > PhD > 2020
Authors:
Faeze Asdaghi [Author], Ali Solyemani Aiouri[Supervisor]
Abstarct: Identifying Web Spam is one of the major challenges for search engines, and various methods have been proposed to identify them. Some of these methods focus on extracting the appropriate features from the webs and others emphasize on providing the appropriate classifier to increase accuracy. Apart from discussing the content, communication and validity of the pages, another group has considered the web graph as the basis of their diagnostic methods. One of the most important challenges in this area is the constant change and updating of ranking algorithm deception techniques. For this reason, the use of features that are less counterfeit, plays an important role in increasing the detection rate and leverage. Also, due to the fact that new weeds are created every day, designing a ranking algorithm that can be trained and improved according to this data will increase the efficiency of the ranking algorithm. For this purpose, in this dissertation, we intend to present a model including feature extractor and classifier in order to identify Web Spam pages. In this model, in addition to using some common features, some features have been extracted from some sources such as page address, html code and lexical and conceptual content of the page. Then, in order to increase the speed and reduce the size of the data, a feature selection algorithm called Smart-BT was developed. Finally, using the concept of detectors and memory cells in the artificial Safety system, a method for detecting webs is presented. In designing this model, only the source code of web pages is used and the multimedia content that is in them (photos, videos, etc.) is not considered. The results of using this model in detecting spam web pages related to the WEBSPAM-UK data set, which is the most famous data set in this field, show an improvement in the balanced accuracy to the extent of 16% and reduction of the number of features to 65%.
Keywords:
#Search Engine #Web Spam #lixnk-baxsed Feature #Content-baxsed Feature #Feature Selection #Text Coherence #Topic Modeling #Artificial Immune System Keeping place: Central Library of Shahrood University
Visitor: