Q302 : Generalization improvement in the face of out of distribution unbalance data
Thesis > Central Library of Shahrood University > Computer Engineering > MSc > 2025
Authors:
Abstarct: Abstract
In recent years, textual content analysis and automatic sentence classification in social media have become important topics in the field of Natural Language Processing (NLP). One of the major challenges in this domain is class imbalance, which significantly affects the performance of models, particularly in identifying minority classes. In this study, a lightweight and optimized model baxsed on the DistilBERT architecture is proposed for classifying tweets into three categories: hate speech, offensive language, and neutral. To enhance model performance, preprocessing techniques such as removal of invalid entries and duplicate samples were applied, along with a cost-sensitive training strategy to address data imbalance. The model was trained on a real-world dataset containing nearly 20,000 tweets and achieved an overall accuracy of 93% and a macro F-1-score of 0.79 in the final evaluation. The results demonstrate that the proposed model, despite its simplicity and small size, performs competitively with heavier models like BERT, making it a suitable option for real-world applications in resource-constrained environments.
Keywords:
#_Natural Language Processing (NLP) #DistilBERT #hate speech #offensive language #data imbalance #F-1-score #BERT_ Keeping place: Central Library of Shahrood University
Visitor:
Visitor: