Q190 : Semi-supervised text clustering using word embeddings
Thesis > Central Library of Shahrood University > Computer Engineering > MSc > 2021
Authors:
Seyed Mojtaba Sajadi [Author], Hoda Mashayekhi[Supervisor], Prof. Hamid Hassanpour[Advisor]
Abstarct: Text clustering is used in various applications of text analysis. In the clustering process, the method of document representation has a significant impact on the results. Some popular document representation methods baxsed on the Bag-of-Words (BoW), depend on the word frequencies and can produce large and sparse document vectors. In addition, embedding-baxsed methods such as Doc2Vec, which preserve the proximity information, suffer from low interpretability. These challenges have been largely addressed by the introduction of concept-baxsed methods. The existing semi-supervised document clustering methods do not use the more recent concept-baxsed representation of documents. Therefore, this paper proposes a concept-baxsed semi-supervised document clustering approach that uses both labeled and unlabeled data to yield a higher clustering quality. The documents are represented baxsed on the concepts extracted from the set of embedded words in the corpus. This representation preserves the proximity information of documents and improves interpretability. The semi-supervised clustering process uses unlabeled data to capture the overall structure of the clusters and a small number of labeled data to adjust the centroid of clusters. We also propose the notion of semi-supervised concepts and a new method of clustering documents baxsed on the weights of the concepts. The clustering results of this method are compared with the actual semi-supervised clustering of documents. Experiments on two sets of text data, Reuters-21578 and 20-NewsGroup, demonstrate that the proposed method performs at least 10% better in terms of clustering quality and at least 5% in text classification accuracy than several existing semi-supervised and unsupervised approaches.
Keywords:
#Machine learning #Semi-supervised #Concept extraction #Word embedding #Document clustering #Concept-baxsed representation Keeping place: Central Library of Shahrood University
Visitor: