TK364 : Subject-Analysis and Author-Attribution in Farsi-Arabic Documents Using Structural Information of the Text
Thesis > Central Library of Shahrood University > Electrical Engineering > MSc > 2014
Authors:
Ali Shahnama [Author], Alireza Ahmadifard[Supervisor], Hosein Marvi[Supervisor], Morteza Zahedi[Advisor]
Abstarct: Authorship Attribution is a subfield of Text-Processing and aims to determine identity of a text’s writer. In the other word, main objective of this subfield is to design a system which can “attribute” an unlabeled text (text with unknown author) to one of candidate authors. To design such a system we must access some labeled texts (texts with known author) for each candidate author. All previous research on authorship attribution of Persian texts were belonged to NLP-baxsed methods; but the main objective of this thesis is to study performance of NDP methods in solving Persian author attribution problems. These methods are designed baxsed on “N-grams” and are completely independent of NLP systems. In this thesis most important NDP methods are studied and then two novel methods are proposed baxsed on them. The first proposed method (CNG-WIS) uses “Indices” of n-grams rather than their “Frequencies”. The second proposed method (VNG) uses “Variant N-grams” rather than “Most-Frequent N-grams”. In order to evaluate studied methods and also compare proposed method with them, we use four different corpuses (i.e. databaxses). One of them is gathered by author and contains 145 text from 6 Persian contemporary authors. Results indicates that in addition to studied NDP methods, the two proposed methods are powerful in solving authorship attribution problems in Persian and Arabic texts. At the end, two especial problem in Persian Literature are studied: Golestan’s Rivals and Indian-Styled Sonnets. For this purpose, two additional corpuses are gathered by the author: GBP (contains 75 Hekayats from 3 authors) and SBH (contains 90 sonnets from 3 poets). Results indicates that in addition to Persian prose texts, NDP methods have good performance on Hekayat (mixture of prose and poem) and poems.
Keywords:
#authorship attribution #n-gram #profile-baxsed methods #style marker #CPPT corpus #GBP corpus #SBH corpus. Link
Keeping place: Central Library of Shahrood University
Visitor: