Q280 : A multimodal analysis of image-text relation in question answering systems
Thesis > Central Library of Shahrood University > Computer Engineering > MSc > 2024
Authors:
[Author], Mansoor Fateh[Supervisor], Hossein Morshedlou[Advisor]
Abstarct: Visual Question Answering (VQA is an interdisciplinary field that combines computer vision and natural language processing to answer questions related to the content of images. This thesis explores the challenges and solutions in this area, focusing on improving the accuracy and performance of VQA models. By analyzing and evaluating existing VQA architectures and introducing a new model, this research aims to present optimized methods for integrating textual and visual information. The use of deep learning techniques such as Convolutional Neural Networks (CNNs) and advanced language models like BERT for extracting semantic and visual features, and enhancing the reasoning and answering process, is one of the innovations of this study. The proposed algorithm has been tested on various datasets, including VQA v2, and the results indicate that the proposed architecture, in addition to addressing existing challenges, achieved an accuracy of 73.3%. This accuracy shows a significant improvement compared to previous methods such as LoRRA, and can be effectively applied to various human-machine interaction scenarios, including assistance for visually impaired individuals, surveillance systems, and medical image analysis.
Keywords:
#Visual Question Answering #Computer Vision #Natural Language Processing #Deep Learning #Convolutional Neural Networks #BERT. Keeping place: Central Library of Shahrood University
Visitor: