Q56 : A statistical approach to combining multi-part words
Thesis > Central Library of Shahrood University > Computer Engineering > MSc > 2014
Authors:
Arezoo Arjomandzadeh [Author], Morteza Zahedi[Supervisor], [Advisor]
Abstarct: Persian language consists of words that are made up of multi-parts and these multi-part words have standard scxript, it means that it employs half-space between the parts of multi-part words to keep integrity of the parts of the multi-part words. According to this standard Persian scxript, half-space has an important role in readability of the text and the reader can understand the meaning of the text. Moreover, in natural language processing, including machine translation, words boundary detection has a considerable impact on system performance. In this thesis, a new statistical method according to statistical machine translation is provided for Persian text editing. In this method spaces between the parts of multi-part words are replaced with the half-space with the aid of statistical machine translation. Linguistic information extracted from the parallel corpus and then this information is used to identify and edit multipart words. In this method a parallel corpora is needed to train in which an unedited corpora is on one side and the edited one is another side that is created in this thesis. The results show the efficiency of the method in more accurate detecting and editing space between the parts of the multi-part words with half-space.
Keywords:
#Persian Multi-Part Words #Spacing Rules #Statistical Machine Translation #Persian Parallel Corpora #Combining Persian Multi-part Words Link
Keeping place: Central Library of Shahrood University
Visitor: