Building and better understanding vision-language models insights and future directions

#multimodal #vision-language-models
Building on top of What matters when building vision-language models?

Introduce Docmatrix a synthetic dataset for document understanding. Generating high-quality synthetic data is relatively straightforward if we reframe the problem as one of LLM-based data generation rather than relying solely on VLMs. Standard OCR tools can accurately extract text from PDF documents, and an LLM can then be used to generate QA pairs based on this text. Text transcriptions from the English PDFA dataset are used to generate QA pairs using Phi-3-small. The authors report that training on a small portion of Docmatix leads to a nearly 20% relative improvement on DocVQA.

Is a vision encoder really necessary?

Instead of employing a vision encoder, Fuyu (Bavishi et al., 2023) feeds image patches directly into the language model after applying a simple linear projection to adjust the dimensions. This architecture offers two main advantages:

The latter point is crucial since the original image details might be necessary for accurately responding to the prompt. On the other hand, a pre-trained vision encoder transforms an image into a representation that is independent of the user’s prompt. As a result, vision encoders aim to capture as much information as possible and can still miss details pertinent to the prompt.

VisFocus (Abramovich et al., 2024) attempts to address this drawback by incorporating the user’s prompt into the vision encoder. However, this approach is less natural in interleaved image-text conversations, where prompts may refer back to previous questions. Moreover, handling image representation within the language model might decrease its performance on text-only benchmarks. However most VLMs are still not evaluated on text-only benchmarks, making it unclear whether omitting a vision encoder affects text benchmark performance. This approach has not been tested yet with an efficient pooling strategy that does not significantly reduce information by operating directly on raw pixels.

Idefics 3