Deep Learning Architectures for Multimodal Data Fusion in Natural Language Processing and Computer Vision
Keywords:
Multimodal data fusion, natural language processing, computer vision, deep learning architectures, attention mechanisms, visual question answeringAbstract
Multimodal data fusion combines information from multiple modalities, such as text and images, to achieve a richer representation for natural language processing (NLP) and computer vision (CV) tasks. Deep learning architectures have become a cornerstone for such fusion tasks due to their ability to capture complex patterns and interactions. This paper explores prominent deep learning models employed for multimodal data fusion, including feature concatenation, attention mechanisms, and modality-specific encoders. Additionally, we discuss the challenges in integrating heterogeneous data sources, addressing issues such as modality imbalance and information alignment. The findings highlight the evolution of multimodal architectures, emphasizing their significance in advancing tasks such as visual question answering, image captioning, and text-to-image synthesis.
References
Baltrušaitis, T., Ahuja, C., & Morency, L. P. (2019). Multimodal machine learning: A survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(2), 423-443.
Lu, J., Yang, J., Batra, D., & Parikh, D. (2016). Hierarchical question-image co-attention for visual question answering. Advances in Neural Information Processing Systems, 29, 289-297.
Xu, K., Ba, J., Kiros, R., et al. (2015). Show, attend and tell: Neural image caption generation with visual attention. Proceedings of the International Conference on Machine Learning (ICML), 37, 2048-2057.
Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., & Ng, A. Y. (2011). Multimodal deep learning. Proceedings of the International Conference on Machine Learning (ICML), 689-696.
Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C. L., & Parikh, D. (2015). VQA: Visual question answering. Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2425-2433.
Karpathy, A., & Fei-Fei, L. (2015). Deep visual-semantic alignments for generating image descriptions. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 3128-3137.
Srivastava, N., Mansimov, E., & Salakhutdinov, R. (2015). Unsupervised learning of video representations using LSTMs. Proceedings of the International Conference on Machine Learning (ICML), 843-852.
Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. Advances in Neural Information Processing Systems, 28, 91-99.
Huang, P. Y., Wu, C. Y., Tai, Y. S., & Yu, Y. (2015). Attend what you want: Object-specific attention for action recognition. Proceedings of the British Machine Vision Conference (BMVC), 1-11.
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.