Q157 : A new model for text coherence evaluation using statistical characteristics
Thesis > Central Library of Shahrood University > Computer Engineering > PhD > 2019
Authors:
Abstarct: Discourse coherence modeling evaluation is an important subfield in Natural Language Processing. In recent years, there has been an increasing interest but challenging task in text coherence evaluation. Document coherence evaluation methods are divided into two main categories of local and global coherence evaluation. Using automatic methods for evaluating or increasing the quality of coherence is considered the most important goal of all text processing systems such as document summarization, text generation, text simplification, statistical machine translation, mode detection, question answering, student essay scoring, produced documents by unskilled people and combined topic texts by unskilled persons. Therefore, all of the machine-driven NLP tasks tend to measure the coherence in order to improve their processing algorithm.
In recent years, there have been several investigations into text coherence evaluation. It is also high-quality systems are designed with the ability to produce very close texts to human written. However, most of proposed models are engaging with semantic and linguistic concepts of text. The most important challenge of them is limiting to a particular area, lack of applicability and expansion into other languages, complex algorithms and inaccuracies. Most of the previous approaches require strong assumptions and specific features to evaluate the coherence. Discovering the different text sections relationship and features selection are often has been done by users. Most proposed methods often assess local coherence limited to only a few adjacent sentences. Their accuracy in evaluating global coherence, especially in long documents, is not acceptable and low accuracy. Previous important and existing approaches, such as entity and graph-baxsed models, are much involved with semantic and linguistic concepts. By limiting themselves to available word co-occurrence information in sequential sentences within a short part of a text, these methods have engaged with inaccuracy in public coherence evaluation. It is also there is few offered approaches that evaluated local and global coherence simultaneously. Methods which evaluate local and global coherence concurrently, only had an acceptable local coherence accuracy and do not have global coherence precision. One of their greatest challenges is their limitation on long text coherence evaluation and suitable for low number sentences documents.
In this thesis, we attempt to assess the coherence and sentences dependency in whole text using statistical approaches and text hidden knowledge. Using Google's word2vec algorithm, the proposed approach converts words into numeric vectors and sentences into numeric matrices. Applying statistical approaches baxsed on recent results in word embeddings, presented method introduces a simple and efficient model called "ECEM" and studies how to incorporate the external word correlation knowledge to assess both local and global coherence simultaneously. It is also assessing the local topic integrity of text at the paragraph level regardless of word meaning and handcrafted rules. The global coherence in proposed method is evaluated by sequence paragraph dependency. The most important feature of the proposed model is the ability to simultaneously assess high precision local and global coherence in large and high-number sentences. The combined presented local and global coherence evaluation method does not depend on subject matter and words concept, and has the ability to extend and apply to other languages
Keywords:
#Text coherence #local coherence #global coherence #word vector space #language models
Keeping place: Central Library of Shahrood University
Visitor:
Keeping place: Central Library of Shahrood University
Visitor: