With the emergence of huge amounts of heterogeneous multi-modal data—including images, videos, texts/languages, audios, and multi-sensor data—deep learning-based methods have shown promising performance for various computer vision and machine learning tasks, such as visual comprehension, video understanding, visual-linguistic analysis, and multi-modal fusion.With the emergence of huge amounts of heterogeneous multi-modal data—including images, videos, texts/languages, audios, and multi-sensor data—deep learning-based methods have shown promising performance for various computer vision and machine learning tasks, such as visual comprehension, video understanding, visual-linguistic analysis, and multi-modal fusion.[#item_full_content]