Recent Advances in MultiModal Large Language Models

#Paper #MLLM

[2401.13601] MM-LLMs: Recent Advances in MultiModal Large Language Models

Model Architecture: A general framework for MM-LLMs, consisting of five components: Modality Encoder, Input Projector, LLM Backbone, Output Projector, and Modality Generator.
- Modality Encoder: Encodes inputs from different modalities (e.g., image, video, audio, etc.) into features using pre-trained models (e.g., NFNet-F6, ViT, CLIP ViT, C-Former, Mamba, etc.).
- Input Projector: Aligns the features of other modalities with the text feature space using linear or non-linear projection methods (e.g., Linear Projector, MLP, Cross-attention, Q-Former, etc.).
- LLM Backbone: The core component that leverages a pre-trained LLM (e.g., Flan-T5, UL2, LLaMA, etc.) to process the aligned features and generate text features or signals.
- Output Projector: Maps the text features or signals to the feature space of the target modality using linear or non-linear projection methods (e.g., Tiny Transformer, MLP, etc.).
- Modality Generator: Produces outputs in different modalities using pre-trained generative models (e.g., Stable diffusion, Zeroscope, AudioLDM, etc.) conditioned on the mapped features.