Q263 : Semi-supervised data stream clustering
Thesis > Central Library of Shahrood University > Computer Engineering > MSc > 2022
Authors:
[Author], Hoda Mashayekhi[Supervisor]
Abstarct: Data flow is an unlimited and orderly sequence that is generated from different information sources with high speed and high volume. Data streams are different from traditional stored data in many aspects. In most cases, some real class labels are not available for all stream instances and there is no prior information about the number of classes. Therefore, semi-supervised clustering is one of the suitable methods of data mining and data analysis for data flow. Nowadays, the data sets for main memories are very large and need to be stored in secondary memories. Therefore, using random access methods according to what was used in traditional data mining methods is very costly. Common data mining algorithms need several passes on data and access to old data, which do not have the necessary efficiency due to memory limitations, and on the other hand, they are very slow and impractical for processing huge volumes of data streams. are. In studies on the online phase, there are two main data structures to store the summary information of the data, which include the network and the micro-cluster. In our method, the micro-cluster structure is used. In the proposed thesis method, a fast improved algorithm for data stream clustering was presented, which uses online micro-clusters to summarize stream data in a compact form. We use an online microcluster-baxsed learning model that automatically learns the reliability or importance of these microclusters over time through an error-driven method and dynamically selects microclusters. In addition, labeled and unlabeled data were randomly predicted. Since one of the problems of the k-means clustering method depends on the selection of the initial points, we selected the appropriate initial points using the k-means++ algorithm. Also, our learning model has been able to detect a new class from the dataset. In this way, the model learns by determining and entering the training data and then it is tested with the test data. The results show that the proposed method has a better performance than other methods.
Keywords:
#data flow #clustering #labeled data #microclusters Keeping place: Central Library of Shahrood University
Visitor: