You are in:Home/Publications/Performance vs. Complexity Comparative Analysis of Multimodal Bilinear Pooling Fusion Approaches for Deep Learning-Based Visual Arabic-Question Answering Systems

Dr. Shimaa Ibrahim Hassan Rizk :: Publications:

Title:
Performance vs. Complexity Comparative Analysis of Multimodal Bilinear Pooling Fusion Approaches for Deep Learning-Based Visual Arabic-Question Answering Systems
Authors: Sarah Kamel, Mai Fadel, Lamiaa Elrefaei, Shimaa Hassan
Year: 2025
Keywords: Arabic-VQA; deep learning-based VQA; deep multimodal information fusion; multimodal representa tion learning; VQA of yes/no questions; VQA model complexity; VQA model performance; performance-complexity trade-off
Journal: Computer Modeling in Engineering & Sciences
Volume: 143
Issue: 1
Pages: 373
Publisher: Tech Science Press
Local/International: International
Paper Link:
Full paper Shimaa Ibrahim Hassan Rizk_Performance vs. Complexity Comparative Analysis of Multimodal Bilinear.pdf
Supplementary materials Not Available
Abstract:

Visual question answering (VQA) is a multimodal task, involving a deep understanding of the image scene and the question’s meaning and capturing the relevant correlations between both modalities to infer the appropriate answer. In this paper, we propose a VQA system intended to answer yes/no questions about real-world images, in Arabic. To support a robust VQA system, we work in two directions: (Ô) Using deep neural networks to semantically represent the given imageandquestionina•ne-grainedmanner,namelyResNet-Ô òandGatedRecurrent Units (GRU).(ò)Studyingtheroleoftheutilizedmultimodalbilinearpoolingfusiontechniqueinthetrade-ošbetween the model complexity and the overall model performance. Some fusion techniques could signi•cantly increase the modelcomplexity,whichseriouslylimitstheir applicability for VQAmodels.Sofar,thereis noevidenceofhowe›cient these multimodal bilinear pooling fusion techniques are for VQA systems dedicated to yes/no questions. Hence, a comparative analysis is conducted between eight bilinear pooling fusion techniques, in terms of their ability to reduce the modelcomplexityandimprovethemodelperformanceinthiscaseofVQAsystems.Experimentsindicatethatthese multimodal bilinear pooling fusion techniques have improved the VQA model’s performance, until reaching the best performance of —À.ò %. Further, experiments have proven that the number of answers in the developed VQA system is a critical factor that ašects the ešectiveness of these multimodal bilinear pooling techniques in achieving their main objective of reducing the model complexity. e Multimodal Local Perception Bilinear Pooling (MLPB) technique has shown the best balance between the model complexity and its performance, for VQA systems designed to answer yes/no questions.

Google ScholarAcdemia.eduResearch GateLinkedinFacebookTwitterGoogle PlusYoutubeWordpressInstagramMendeleyZoteroEvernoteORCIDScopus