VLMs

#multimodal #vision-language-models

Multimodal VLMs such as CLIP trained with a contrastive objective have enabled zero-shot adaptation to novel tasks, without the need for fine-tuning.

Modelling

Since the introduction of Frozen (Tsimpoukelli et al., 2021) and Flamingo, most VLMs have been built on top of unimodal pre-trained backbones rather than training entirely new models from scratch.

However since these models build on top of pre-trained LMs, and as a side effect, directly inherit their weaknesses such as hallucinations and poor generalisation to long sequence lengths.

Cross-Attention

The cross-attention architecture is introduced in Flamingo. The image hidden states encoded by the vision backbone are used to condition the frozen language model using freshly initialised cross-attention layers that are interleaved between the pre-trained language model layers. The keys and values in these layers are obtained from the vision features, while the queries are derived from the language inputs.
This form of in-context learning has significant advantages over gradient-based few-shot learning methods.

Self-Attention

The self-attention architecture introduced by FROMAGe (Koh et al., 2023) and BLIP2 (Li et al., 2023), the output of the vision encoder is treated as tokens and concatenated to the sequence of text tokens. The entire sequence is then passed as input to the language model. We refer to the layers that map the vision-hidden space to the text-hidden space as modality projection layers.


Training

Training VLMs typically occurs in multiple stages due to the following reasons:

  1. Limited availability of high quality data
  2. Memory constraints for efficient training
  3. Stability issues
    During these stages, progressively higher-quality data is introduced, the maximum image resolution is gradually increased, and more model parts are unfrozen. Idefics 3 contains a nice graphic showing the key stages of training and the types of datasets used at each stage.

1. Pre-Training

The primary goal of pre-training is to align the backbone models and train the newly initialised parameters in the model. To efficiently train on a large number of images, the image resolution is typically kept low at the start of training and gradually increased over time. Once the resolution is sufficiently high, datasets containing large images, such as PDFs, can be incorporated into the training data.

2. Supervised Fine-Tuning

Having trained a general purpose vision-language representation model during the pre-training phase, we know perform supervised fine-tuning to train the model for a number of tasks

3. Alignment

Why align after SFT?

Evaluation

Traditional multimodal benchmarks, such as VQAv2, OKVQA, TextVQA, and COCO Captioning are mainly open-ended. These benchmarks rely on specific ground-truth answers for each question, so even minor variations in the model’s responses can lead to a score marked as incorrect. Mitigations ?

Boosting Performance

Data

Captions Avg. Score
Alt-texts 49.8
Synthetic 52.9
OCR Data Resolution Performance on DocVQA
w/o 384 22.6
w/o 768 42.9
w 768 49.9

Training Strategies and Tricks

Pre-Training

Instruction Tuning

Benchmarks