Q247 : Dynamic text clustering using word embedding
Thesis > Central Library of Shahrood University > Computer Engineering > PhD > 2023
Authors:
Mahboubeh Soleymanian [Author], Hoda Mashayekhi[Supervisor], [Advisor]
Abstarct: With the rapid advancement of technology and the growing reliance on software programs, we are confronted with a large volume of text data being generated continuously and swiftly by various programs in electronic form. Managing this data, which is complex in its practical and beneficial use and requires meaningful organization, involves analyzing it as it flows continuously and rapidly a task that demands real-time processes to promptly address dynamic data changes. Dynamic clustering, a method responsive to temporal data changes, is typically employed for evolving data that necessitates detailed analysis of its progression. A key aspect of dynamic clustering is its capacity to adjust to temporal changes, identify new patterns, and update clusters over time. One significant application of dynamic text clustering is to efficiently analyze data by leveraging current and precise information extracted from data structures and patterns. However, challenges such as managing temporal data changes, updating clusters, identifying new patterns, and ensuring clustering efficiency and accuracy are encountered when using this method. Hence, designing a model baxsed on data-text flow for simultaneous management of concept change and evolution holds significant importance. To address this, a conceptual clustering approach for textual data flow is proposed in this thesis. This approach dynamically learns evolving concepts to explore both concept change and evolution by summarizing and maintaining the statistical structure of the data, leading to dynamic text analysis. Furthermore, fuzzy concept modeling is suggested incrementally to utilize fuzzy clustering techniques, adapting to changes in data and cluster distribution baxsed on concept value and importance. Incremental clustering processes have been developed to adjust to concept and data characteristic changes using a two-stage online and offline structure, extracting concise information from each concept without the need to predetermine the number of clusters. This enables achieving a display baxsed on document concepts without clustering all documents at the beginning of the process. The first proposed method was tested on R52, 20N, and T89 text datasets. The results indicate an improvement in the performance of the NMI criterion by 29% and 2% in the first two data sets in comparison with recent methods such as FGSDMM+ and FPCA/packing. Subsequently, the second proposed method was tested on the News-T and Tweet-T datasets, and the performance improvement in the NMI criterion was observed by 5% and 3% compared to recent methods such as DCSS, MStream and OSDM. Lastly, testing the second proposed method on R52 and 20N datasets yielded 5% and 31% improvement in NMI criteria was obtained compared to using the first proposed method in these datasets, indicating the presentation of a suitable model for concept-baxsed incremental clustering in textual data streams.  
Keywords:
#Text clustering #Dynamic clustering #Incremental clustering #Word embedding #Concept extraction #Concept-baxsed representation #Fuzzy clustering Keeping place: Central Library of Shahrood University
Visitor: