Q173 : Estimating semantic baxsed text similarity by using statistical methods
Thesis > Central Library of Shahrood University > Computer Engineering > MSc > 2020
Authors:
Fereshteh Riahi [Author], Morteza Zahedi[Supervisor]
Abstarct: In today’s world, due to the increasing volume of internet information and the multiplicity of different digital cultures, the existence of similarities in textual data has increased for various reasons. For this reason,estimatingthedegreeofsimilaritybetweenthemisnecessary. Findingsimilaritiesbetweentextual data is also used in areas such as information retrieval systems, plagiarism, data mining, document classification,andmuchmore. Byenteringasentenceordocumentortextintothesystem,itssimilaritywithexistingdocumentsis checked and Used in the field in question. So far, various methods of natural language processing and machine learning have been introduced to calculate the similarity of textual data.These methods have achieved different degrees of accuracy, and better research is needed to improve this criterion. Among the statistical methods, which is one of the corpus-baxsed methods, and by combining this method with othermethods,interestingresultscanbeachieved. In this research, data is applied to the preprocessing machine before converting the text to readable formattoreducethefeaturesandincreasethemeasurementaccuracy. EachwordisthenassignedacorrelationscoreusingLatentDirichletAllocation,andthisprocessisrepeatedbaxsedontheprobabilitiesto improvetheassignmentofthecorrelationscoretothewordsandthesentenceisplacedinthedesiredcategory. Doc2bowisalsousedtorepresentationsentencesis. Then,tocalculatethemostsimilarsentences, theJensenShannondistanceisused,whichisobtainedbycomparingthedivergenceofthedesiredlabel distribution. Finally, themachineusesthelinearsupportvectorandtheRadialbaxsefunctiontoclassify similarsentences. The proposedmethod achieved aclassification accuracy of 89percent and a correlation score of 0.92 percent on the SICK databaxse and was able to be 3.8 percent better accurate than the MaLSTMresearch
Keywords:
#textsemanticsimilarity #LatentDirichletAllocation #documenttoBagofwords #JensenShannondistance #Supportvectormachines Keeping place: Central Library of Shahrood University
Visitor: