Q238 : Developing a word-embedding baxsed method for evaluation of topic models
Thesis > Central Library of Shahrood University > Computer Engineering > MSc > 2022
Authors:
Abstarct:
Topic modeling is a powerful tool for analyzing large unstructured textual data. They are applicable in many fields such as software engineering, political science, and linguistics. Just like any other model in the context of machine learning, topic models require an evaluation metric so they can be compared and improved. Each topic in the topic model is represented with the set of its top words. In case the topic is coherent enough, the top words are supposed to semantically correlate well. The semantic correlation of words representing a topic depends on human recognition; but since the vast majority of topic models are built in an unsupervised method, measuring their coherence has always been a challenge.
In this thesis, we have proposed a measure for evaluating coherence of topic models, baxsed on the similarity of pairs of word vectors. The vectors used in the proposed method are word embedding vectors that are generated by training shallow or deep neural networks on very large sets of textual data, and therefore, are a much more accurate representation compared to count-baxsed vectors; Because neural networks - especially deep networks - are able to learn linguistic features in addition to the frequency of words and their co-occurrence and reflect them in the resulting vectors. Additionally, in the proposed coherence metric, to calculate the similarity of word pairs, we used the features ranks in the word vectors instead of the direct value of the vector elements, and this increases the correlation of this criterion with human judgements. Finally, we will see that our proposed metric, despite the lightness of the calculations, has 82% correlation with human recognition of topic coherence, which shows a significant increase compared to the previous metrics.
In addition to this coherence evaluation measure, we have also proposed a method for visualizing topic models in this thesis, which displays the relationship between topics baxsed on the embedding vectors of their top words.
Keywords:
#topic model #coherence #LDA #word embedding #word vector Keeping place: Central Library of Shahrood University
Visitor:
Visitor: