VLMs

#multimodal #vision-language-models

Multimodal VLMs such as CLIP trained with a contrastive objective have enabled zero-shot adaptation to novel tasks, without the need for fine-tuning.

Modelling

Since the introduction of Frozen (Tsimpoukelli et al., 2021) and Flamingo, most VLMs have been built on top of unimodal pre-trained backbones rather than training entirely new models from scratch.

A pre-trained vision encoder transforms an image into a representation that is independent of the user’s prompt, thereby trying to capture as much information as possible. They can still however miss details pertinent to the prompt.

However since these models build on top of pre-trained LMs, and as a side effect, directly inherit their weaknesses such as hallucinations and poor generalisation to long sequence lengths.

Cross-Attention

The cross-attention architecture is introduced in Flamingo. The image hidden states encoded by the vision backbone are used to condition the frozen language model using freshly initialised cross-attention layers that are interleaved between the pre-trained language model layers. The keys and values in these layers are obtained from the vision features, while the queries are derived from the language inputs.
This form of in-context learning has significant advantages over gradient-based few-shot learning methods.

Self-Attention

The self-attention architecture introduced by FROMAGe (Koh et al., 2023) and BLIP2 (Li et al., 2023), the output of the vision encoder is treated as tokens and concatenated to the sequence of text tokens. The entire sequence is then passed as input to the language model. We refer to the layers that map the vision-hidden space to the text-hidden space as modality projection layers.

Training

Training VLMs typically occurs in multiple stages due to the following reasons:

Limited availability of high quality data
Memory constraints for efficient training
Stability issues
During these stages, progressively higher-quality data is introduced, the maximum image resolution is gradually increased, and more model parts are unfrozen. Idefics 3 contains a nice graphic showing the key stages of training and the types of datasets used at each stage.

1. Pre-Training

The primary goal of pre-training is to align the backbone models and train the newly initialised parameters in the model. To efficiently train on a large number of images, the image resolution is typically kept low at the start of training and gradually increased over time. Once the resolution is sufficiently high, datasets containing large images, such as PDFs, can be incorporated into the training data.

2. Supervised Fine-Tuning

Having trained a general purpose vision-language representation model during the pre-training phase, we know perform supervised fine-tuning to train the model for a number of tasks

3. Alignment

Why align after SFT?

Align the model’s output with human preferences, making it more intuitive and better at following complex instructions.
Effectively reduces hallucinations, where the model might describe objects or details not actually present in the image.
Enhances model safety by minimising the risk of generating harmful content.

Evaluation

Traditional multimodal benchmarks, such as VQAv2, OKVQA, TextVQA, and COCO Captioning are mainly open-ended. These benchmarks rely on specific ground-truth answers for each question, so even minor variations in the model’s responses can lead to a score marked as incorrect. Mitigations ?

Few-shot evaluations
LAVE metric consists of asking an LLM to evaluate whether the response generated by the VLM is correct, given the ground truth and the specific question, thereby reducing the template problem.
Use benchmarks that include multiple-choice questions (MCQs), where the model selects the correct option by choosing the corresponding letter

Boosting Performance

Li et al. (FLIP) inspired by the sparse computation of Masked Auto-encoders propose to randomly remove a large portion of image patches during CLIP based contrastive image-text pre-training allowing for models to learn from more image-text pairs given the same wall-clock time and contrast more samples per iteration with similar memory footprint.
Sun et al. (EVA-CLIP) use pre-trained EVA models which combine high-level semantics of image-text contrastive learning with geometric and structural capture from masked image modelling to improve feature representation and expedite convergence of CLIP models
Chen and Wang, 2022 (Pali) report a stronger increase in performance by scaling the size of the vision encoder compared to scaling the size of the language model even though scaling the vision encoder leads to a smaller parameter count increase.
Because vision encoders are often trained on different datasets and optimised for various tasks, some models, like SPHINX (Lin et al., 2023), combine representations from multiple encoders, such as DINOv2 (Oquab et al., 2023) and CLIP (Radford et al., 2021), to create a richer sequence of visual embeddings, though this comes at the expense of computational efficiency.

Data

OBELICS: Open dataset of interleaved image-text documents. It has been reported that interleaved image-text documents are the biggest driving factor in increasing the performance on visual question answering (VQA) tasks.
LAION COCO: Higher quality COCO consisting of synthetic captions generated using a model trained on COCO. Source: What matters when building vision-language models?

Captions	Avg. Score
Alt-texts	49.8
Synthetic	52.9

It has been shown that a large proportion of mistakes of VLMs stem from their failure to accurately extract text in images or documents. Dataset should be complemented with texts written with a wide variety of fonts and colours and on diverse backgrounds.
Adding PDF documents helps the model learn to read text from images.

OCR Data	Resolution	Performance on DocVQA
w/o	384	22.6
w/o	768	42.9
w	768	49.9

Cauldron: a massive collection of 50 vision-language datasets, covering a wide range of tasks: general visual question answering, counting, captioning, text transcription, document understanding, chart/figure understanding, table understanding, visual reasoning, geometry, spotting differences between 2 images or converting a screenshot to a functional code.

Training Strategies and Tricks

Pre-Training

What matters when building vision-language models?: Breakdown pre-training into two parts. First train at smaller image resolutions and larger batch sizes for image-text documents and/or image-text pairs. Then decrease batch size and train with PDFs.

Instruction Tuning

For multiple question/answer pairs per image, concatenate the pairs into a multi-turn conversations.
Add text-only instruction datasets

Benchmarks

VQAv2 for general visual question answering
TextVQA for OCR
OKVQA for external knowledge
COCO captioning