Visual question answering (VQA) is a multimodal task, involving a deep understanding of the image
scene and the question’s meaning and capturing the relevant correlations between both modalities to infer the
appropriate answer. In this paper, we propose a VQA system intended to answer yes/no questions about real-world
images, in Arabic. To support a robust VQA system, we work in two directions: (Ô) Using deep neural networks to
semantically represent the given imageandquestionina•ne-grainedmanner,namelyResNet-Ô òandGatedRecurrent
Units (GRU).(ò)Studyingtheroleoftheutilizedmultimodalbilinearpoolingfusiontechniqueinthetrade-ošbetween
the model complexity and the overall model performance. Some fusion techniques could signi•cantly increase the
modelcomplexity,whichseriouslylimitstheir applicability for VQAmodels.Sofar,thereis noevidenceofhowe›cient
these multimodal bilinear pooling fusion techniques are for VQA systems dedicated to yes/no questions. Hence, a
comparative analysis is conducted between eight bilinear pooling fusion techniques, in terms of their ability to reduce
the modelcomplexityandimprovethemodelperformanceinthiscaseofVQAsystems.Experimentsindicatethatthese
multimodal bilinear pooling fusion techniques have improved the VQA model’s performance, until reaching the best
performance of —À.ò %. Further, experiments have proven that the number of answers in the developed VQA system
is a critical factor that ašects the ešectiveness of these multimodal bilinear pooling techniques in achieving their main
objective of reducing the model complexity. e Multimodal Local Perception Bilinear Pooling (MLPB) technique
has shown the best balance between the model complexity and its performance, for VQA systems designed to answer
yes/no questions. |