Multimodal Large Language Models Released in 2024 and the Design Methods They Employ

In today's post, I will review the 2024 literature on multimodal large language models (LLMs), focusing primarily on work published in the past few months to ensure that the scope is reasonable.

Therefore, this is not a historical overview or a comprehensive review of multimodal LLM, but rather a brief exploration of recent advances. At the same time, I will try to keep the summary concise without adding too much extraneous content, as 10 studies will be presented.

The conclusion section at the end of the article will provide a summary comparing the methods used in these papers.

4.1 Llama 3 model series

Meta AI's Llama 3 model series The paper (published July 31, 2024) was released this summer, but it feels like a long time ago in the LLM space. However, considering that they only described the multimodal model, but didn't actually release the model until much later, I think it makes sense to include Llama 3 in this list. (The Llama 3.2 model was officially announced and open accessed on September 25, 2024.)

Llama 3.2 is a multimodal model available in versions with 11 and 90 billion parameters. These models are based on the previously described cross-attention approach, consistent with that shown below.

Schematic of the multimodal LLM method for Llama 3.2

(From the annotated diagram of the Llama 3 paper:/abs/2407.21783. this image is visually obscured to emphasize the image portion; the video and voice portions are visually obscured.)

Note that while the figure shows video and speech as possible modalities, as of the time of writing, the published model only supports images and text.

Llama 3.2 uses a cross-attention approach, but it is a bit different from what I mentioned before. Usually in multimodal LLM development, we freeze the parameters of the image encoder and only update the parameters of the language model. However, the researchers here took almost the opposite approach: they updated the parameters of the image encoder, while the parameters of the language model remained unchanged. The researchers note that this was done intentionally to preserve its text-only capabilities, allowing the 11 and 90 billion parameter multimodal models to seamlessly replace Llama 3.1's 8 and 70 billion parameter text-only models for text tasks.

Training process

Training was done in multiple iterations, starting with the text model of Llama 3.1. After adding the image encoder and projection (i.e., "adapter") layers, the model is pre-trained on the graphical data. Then, similar to the Llama 3 text-only model training process (which I wrote about in a previous post), the model is fine-tuned with instructions and preferences.

Unlike using a pre-trained model such as CLIP as an image encoder, the researchers pre-trained a visual transformer (ViT) from scratch. Specifically, they used the ViT-H/14 variant (630 million parameters) of the classical visual transformer architecture (Dosovitskiy et al., 2020). They pre-trained on a dataset containing 2.5 billion pairs of graphic data for 5 epochs, which occurred before connecting the image encoder to the LLM. (The image encoder receives images of 224 × 224 resolution and splits them into 14 × 14 grids with each grid block size of 16 × 16 pixels.)

Since the cross-attention layer significantly increases the number of parameters, the cross-attention layer is added only once every four transformer blocks. (For a model with 8 billion parameters, this would add 3 billion parameters; for a model with 70 billion parameters, this would add 20 billion parameters.)

4.2 Molmo and PixMo: Exploring the Frontiers of Multimodal Modeling with Open Source Weights and Data

Molmo and PixMo: Exploring the Frontiers of Multimodal Models with Open-Source Weights and Data The paper (September 25, 2024) is very noteworthy because it promises not only to open-source the model weights, but also to open-source the dataset and the source code, analogous to the purely linguistic model, OLMo LLM. (This is very advantageous for LLM research because researchers can view the complete training process and code, as well as perform ablation experiments and reproduce the results on the the same dataset to perform ablation experiments and reproduce the results.)

If you're curious why there are two names in the title of the paper - Molmo Fingerprinting Model (Multimodal Open Language Model) and PixMo (Pixels for Molmo) is the corresponding dataset.

Molmo Decoder-Only Method Schematic (Method A)

Note Figure adapted from the paper "Molmo and PixMo: Exploring the Frontiers of Multimodal Modeling with Open Source Weights and Data":

/abs/2409.17146

As shown above, the image encoder uses an off-the-shelf Vision Transformer, specifically the CLIP model. The "Connector" here refers to the "Projector", which serves to align the image features with the language model.

Molmo simplifies the training process by avoiding multiple pre-training phases and opting for a simpler unified training pipeline. This method updates all parameters, including those of the base LLM, the connectors, and the image encoder.

The Molmo team offers a variety of options for basic LLM:

• OLMo-7B-1024

(Fully open-source modeling framework)

• OLMoE-1B-7B

(an expert hybrid architecture, which is the most efficient model)

• Qwen2 7B

(an open source weighting model that outperforms OLMo-7B-1024)

• Qwen2 72B

(open source weighting model, also the best performing model)

4.3 NVLM: open frontier-level multimodal LLMs

NVIDIA's NVLM: Open Frontier-Level Multimodal LLM paper (September 17, 2024) is very interesting because it doesn't just focus on one approach, but explores both:

- Method A: Unified Embedded Decoding Architecture ("Decoder Only Architecture", NVLM-D);

- Approach B: Cross-modal Attention Architecture ("Cross-Attention Based Architecture", NVLM-X).

In addition, they developed a hybrid method (NVLM-H) and provided a fair comparison of the three methods.

Overview of three multimodal approaches

(Annotated figure taken from the paper "NVLM: Open Frontier-Level Multimodal LLM":/abs/2409.11402）

As summarized in the figure below, NVLM-D corresponds to method A, while NVLM-X corresponds to method B, as discussed in the previous section. The concept of the hybrid model (NVLM-H) is to combine the strengths of both approaches: a thumbnail image is first input, and then a dynamic set of number of image blocks are passed across attention to capture higher resolution detail information.

The findings of the research team can be summarized as follows:

The NVLM-X demonstrates superior computational efficiency when processing high-resolution images.
The NVLM-D achieves higher accuracy in OCR (Optical Character Recognition) related tasks.
NVLM-H combines the advantages of both methods.

Training process and methodology

Similar to Molmo and other approaches, instead of pre-training a multimodal model from scratch, the NVLM research team started with a text-only LLM (which usually performs better). In addition, they chose to use a command-fine-tuned LLM rather than a base LLM. specifically, their base LLM is Qwen2-72B-Instruct (as far as I know, Molmo uses the Qwen2-72B base model).

In the NVLM-D method, they trained all the LLM parameters, while for NVLM-X, they found that freezing the original LLM parameters and training the cross-attention layer only in the pre-training and instruction fine-tuning phases worked well.

Image encoders and projectors

For the image encoder, instead of using the common CLIP model, they chose InternViT-6B and kept the parameters frozen at all stages.

The projector uses a multilayer perceptron (MLP) rather than a single linear layer.

4.4 Qwen2-VL: Enhancing the perception of the world in visual-linguistic models at any resolution

The two previously mentioned papers and models - Molmo and NVLM - are both based on the Qwen2-72B LLM. and in this paper, the Qwen research team has released a multimodal LLM of its own, Qwen2-VL: enhancing visual-linguistic models of the world at Enhanced Perception of the World by Visual-Verbal Models at Any Resolution (published on October 3, 2024)

At the heart of this study is what they call"Naive Dynamic Resolution"Mechanisms (the word "Naive" is used intentionally and is not a misspelling of "Native", although the word "Native" is also appropriate). This mechanism allows the model to handle different resolutions. This mechanism allows the model to process images of different resolutions rather than simply downsampling them, thus allowing the input of images of the original resolution.

Overview of the multimodal Qwen model

(Annotated figure taken from the Qwen2-VL paper:/abs/2409.12191）

The model implements native resolution inputs through a modified visual transformer (ViT), a modification that removes the original absolute position embeddings and introduces 2D-Rotational Position Encoding (2D-RoPE).

They used a classical visual coder with a parameter count of 675M, and LLM backbone models of different sizes, with the specific parameters shown in the table below.

Components of different Qwen2-VL models

(Annotated figure taken from the Qwen2-VL paper:/abs/2409.12191）

The training process is divided into three stages:

Pre-train only the image encoder;
Unfreeze all parameters (including LLM);
Freezes the image encoder and only performs instruction-finetuning on the LLM.

This three-phase process combines efficient visual processing with powerful language comprehension, enabling Qwen2-VL to better perceive and process visual input from the real world.

4.5 Pixtral 12B

Pixtral 12B (September 17, 2024) is Mistral AI's first multimodal model that employs Method A: Unified Embedding Decoding Architecture. Unfortunately, there are no publicly available technical papers or reports, but the Mistral team shared some interesting details in their blog.

Interestingly, they chose not to use a pre-trained image encoder, but instead trained an image encoder with 400 million parameters from scratch. As for the backbone model of the LLM, they used the Mistral NeMo model with 1.2 billion parameters.

Similar to Qwen2-VL, Pixtral also supports variable image sizes natively, as shown in the diagram below.

Diagram of how Pixtral handles images of different sizes

(Annotated image from Pixtral blog post:/news/pixtral-12b/）

4.6 MM1.5: Methods, analysis and insights for multimodal LLM fine-tuning

MM1.5: Methods, Analysis, and Insights for Multimodal LLM Fine-Tuning The paper (September 30, 2024) provides a number of practical suggestions and introduces a hybrid expert multimodal model as well as a Molmo-like dense model. These models range in size from 100 million to 30 billion parameters.

The models described in the paper focus on Method A: Unified Embedded Transformer Architecture, an architecture that efficiently organizes input data for multimodal learning.

In addition, the paper conducts a number of interesting ablation studies that explore the effects of data combination and the use of coordinate tokens (coordinates).

MM1.5 Schematic representation of the methodology, including additional coordinate markers for the representation of the bounding box

(Annotated figure from MM1.5 paper:/abs/2409.20566）

4.7 Aria: an open multimodal native hybrid expert model

Aria: An Open Multimodal Native Hybrid Expert Model The paper (October 8, 2024) introduces another approach to hybrid expert models, similar to Molmo and some variants in the MM1.5 family.

The Aria model has 2.49 billion parameters, of which 350 million are assigned to each text token. The image encoder (SigLIP), on the other hand, has 438 million parameters.

The model is based on the cross-attention approach and the overall training process is as follows:

Training the LLM backbone model from scratch.
Simultaneous pre-training of the LLM backbone and the visual encoder.

4.8 Baichuan-Omni

The Baichuan-Omni Technical Report (October 11, 2024) describes Baichuan-Omni, a 7-billion-parameter multimodal LLM based on Method A: Unified Embedding and Decoding Architecture, as shown in the following figure:

Overview of the Baichuan-Omni model, which can handle multiple input modes

(Annotated image from Baichuan-Omni paper:/abs/2410.08565）

Baichuan-Omni's training process is divided into three phases:

Projector training: initially only the projector is trained, the visual coder and the language model (LLM) remain frozen.
Visual encoder training: next the visual encoder is unfrozen and trained, the LLM remains frozen.
Full Model Training: final unfreezing of the LLM to allow end-to-end training of the entire model.

The model uses the SigLIP visual coder and introduces the AnyRes module to process high-resolution images by downsampling techniques.

Although the report does not explicitly state the backbone model of the LLM, it is likely to be based on the Baichuan 7B LLM based on the model parameter scales and naming convention.

4.9 Emu3: Next mark prediction is all you need

Emu3: Next Marker Prediction is All You Need The paper (September 27, 2024) presents a compelling alternative to diffusion models for image generation that is based entirely on a transformer decoder architecture. While it is not a multimodal LLM in the traditional sense (i.e., a model that focuses on image understanding rather than generation), Emu3 is very interesting because it shows that it is possible to use a transformer decoder for image generation, which is typically the domain of diffusion methods. (Note, however, that there have been similar approaches before, e.g., Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation.)

Emu3 is an LLM-based image generation model that can be used as an alternative to diffusion models

(Annotated figure taken from the Emu3 paper:/abs/2409.18869）

The researchers trained Emu3 from scratch and used theDirect preference optimization (DPO)methodology adjusts the model to results that are consistent with human preferences.

The architecture contains a Vision Tokenizer inspired by SBER-MoVQGAN. The core LLM architecture is based on Llama 2, but the entire model is trained entirely from scratch.

4.10 Janus: Decoupling Visual Coding for Unified Multimodal Understanding and Raw

We previously focused on multimodal LLMs for image comprehension and presented an example of image generation via Emu3 above. Now, the paper "Janus: Decoupling Visual Coding for Unified Multimodal Understanding and Generation" (October 17, 2024) introduces a framework that unifies the tasks of understanding and generation in a single LLM backbone.

A key feature of Janus is the decoupling of visual coding paths to cope with the different needs of comprehension and generation tasks. The researchers note that image comprehension tasks require high-dimensional semantic representations, while generation tasks require local details and global consistency of images. By separating these paths, Janus is able to efficiently handle these different needs.

Overview of the Unified Decoder Only framework used by Janus

(Annotated figure taken from Janus' paper:/abs/2410.13848）

The model uses a SigLIP visual coder similar to Baichuan-Omni to process the visual input. For image generation, it uses a vector quantization (VQ) tagger to handle the generation process.The underlying LLM for Janus is DeepSeek-LLM with 1.3 billion parameters.

Three-stage training process for Janus model

(Annotated figure taken from Janus' paper:/abs/2410.13848）

The training process is divided into the following three stages:

Phase I: Only the projection and image output layers are trained, and the LLM, comprehension and generative encoders remain frozen.
Phase II: Unfreezing the LLM backbone and text output layer to enable uniform pre-training of the model on comprehension and generation tasks.
Stage III: Unfreeze the entire model, including the SigLIP image encoder, for supervised fine-tuning to fully integrate the model and optimize its multimodal capabilities.

reach a verdict

As you may have noticed, I skipped the model and computational performance comparisons almost entirely. First, comparing the performance of LLM and multimodal LLM on public benchmarks is very challenging because of the pervasive problem of data contamination, which means that the test data may have been included in the training data.

In addition, the architectural components are so different that it is difficult to make a truly fair comparison. So hats off to the NVIDIA team for developing multiple versions of NVLM that at least make comparisons between decoder-only and cross-attention approaches possible.

In any case, the main conclusion of this article is that multimodal LLMs can be successfully constructed by a number of different methods. The following figure summarizes the components and training methods of the different models covered in this article.

Overview of the different models and their subcomponents and training methods covered in this paper

I hope you have found these articles educational and have gained a better understanding of how multimodal LLM works!