Q193 : Authorship attribution with statistical modeling on text data
Thesis > Central Library of Shahrood University > Kharazmi Int. Campus & e-Learning Center > MSc > 2016
Authors:
Samane Vazirian [Author], Morteza Zahedi[Supervisor], Prof. Hamid Hassanpour[Advisor]
Abstarct: Authorship attribution (AA) or author identification refers to the problem of determining who has written a disputed text or unseen text. In the close class authorship attribution problem, the unseen text is assigned to any one of candidate authors set, that text sample as training data are available for them. Two main requirements of authorship attribution system are features and attribute method. Features are usually selected with training data. With increasing text in different languages, seems to be an essential need for developing authorship attribution system which is language independent. Since the procedure of extracting character n-grams and word n-grams are language-independent and require no special tools, in this thesis them have been used as features. Also language modeling has been chosen as a statistical and probabilistic attribute method. We present an approach baxsed on language modeling called modified language modeling. It aims to offer a solution for AA problem by combinations of both bigram words weighting and unigram words weighting. Moreover, the IDF value multiplied by related word probability has been used, instead of removing stop words and balancing word probability as weights, as well. In order to evaluate the results, four corpora have been used. Two datasets of Persian poetry, one Persian prose dataset and the fourth is English prose dataset. The accuracy of AA is calculated by language modeling with character n-grams, language modeling with word n-grams and modified language modeling on the four datasets. In all databaxses, modified language modeling shows improvement. The best performance is obtained for modified language by Persian prose dataset, which is 100 percent.
Keywords:
#Authorship Attribution #Authorship Identification #Language Modeling #WMPR-AA2016-A corpora #WMPR-AA2016-B corpora Link
Keeping place: Central Library of Shahrood University
Visitor: