Q134 : Scalable opinion mining using cluster computing
Thesis > Central Library of Shahrood University > Computer Engineering > MSc > 2018
Authors:
Mojtaba Asadollahi [Author], Hoda Mashayekhi[Supervisor]
Abstarct: The ever increasing growth of media and social networks, alongside the spread of analytical commercial and news webpages, have led to an extensive presence of people all around the world in such environments. This has in turn, resulted in enormous propagation and sharing of data and comments. Valuable information could be extracted by analyzing this massive amount of data. But besides this advantage, processing this huge amount of information encounters several challenges. The processing speed of the conventional methods is too slow against the rapid growth of the input data flow. To solve this problem, scalable methods as well as novel tools in the big data processing area, may be employed to accelerate the processing procedure. Although a lot of researches have been already carried out in this area, there is still room for more improvements. Low processing speed, inadequate accuracy rate, high complexity, and domain dependency are among the issues in similar previous studies. Spark is a relatively novel frxamework for processing big data, which carries out processes in main memory to increase the processing speed of the algorithm. In this research, the support vector machine (SVM) and Spark are used simultaneously to classify user opinions. The proposed method, along with maintenance of desirable accuracy, is not dependent on a specific domain and can be simply implemented. To benefit from Spark's ability in distributed processing, some methods of parallel SVMs, including cascade SVM, improved cascade SVM, grouped SVM and voting among multiple SVMs have been utilized. To compare the efficiency of the above-mentioned methods, the standard SVM has also been implemented. In the present study, IMDB dataset is used for the evaluation. This dataset contains various different comments in English on the movies and series. Results for classification indicated the best accuracy to be 85.946%. Moreover, the elapsed time to run the standard SVM algorithm in the Spark environment has been dropped more than 8 times compared to similar research results. The elapsed time for parallel SVMs against the standard SVM has also been remarkably improved. In the best case scenario, the training time of the proposed method is reduced by 60 times compared to the standard SVM training time.
Keywords:
#opinion mining #Support Vector Machine #parallel processing #Spark Link
Keeping place: Central Library of Shahrood University
Visitor: