Shahrood University Of Technology Thesis

Q270 : Automatic Webpage Content Extraction baxsed on Structural and Semantic Features

Thesis > Central Library of Shahrood University > Computer Engineering > MSc > 2024

Authors:

[Author], Hoda Mashayekhi[Supervisor], [Advisor]

Abstarct: The internet is a rich source of textual information, and by purposefully extracting data from its pages, we can obtain vast and suitable datasets for monitoring news, market research with the aim of evaluating competitors, generating language models, and knowledge extraction. However, current structure of web pages is very diverse and complex, and with the development of user interface (UI) design technologies, this complexity has increased. Additionally, web pages often contain irrelevant and sometimes useless information, leading to noise in the data. Developing a tool to extract useful content or eliminate useless content could be an appropriate solution to this problem. The structural diversity of web pages makes the text extraction process a complex and tedious task. complex for machines and tedious for human users. Therefore, designing an intelligent tool that efficiently extracts main text from web pages can be very practical. Existing methods either extract the main content baxsed on rules, which suffer from decreased performance with changes in page design technologies and require constant updating, or are baxsed on machine learning models. These methods also do not perform well across a wide range of pages due to the limited scope of their training datasets or the complexity of the designed models. In this research, an attempt has been made to design an intelligent model for extracting main content using the structural, semantic and content features of various page elements. In this regard, several experiments were designed and conducted to achieve the optimal model. The proposed method blocks and compresses the pages and then, by extracting various features, predicts the final label using a deep neural network. To train the proposed model, a dataset of web pages was collected, and the main content of these pages was manually identified by several volunteers. The advantage of this model is the high diversity of the training dataset and the improved blocking algorithm, which prevents the merging of useful and non-useful texts and avoids generating multiple blocks for a single web page. According to the obtained results, compared to other methods, the proposed method performs better by an average of 3 to 11 percent.

Keywords:

#Main Cotent Extractor #Web Scraping #Boilerpalte Removal #Text Processing #Data Mining #Deep Neural Networks #Web Page Blocking Keeping place: Central Library of Shahrood University
Visitor:

Shahrood University of
Technology

ABOUT

ADMINISTRATION

ADMISSION