Research paper   Blog post   GitHub repository


Introduction

We introduce the Pathways Language and Image (PaLI) model, a scalable approach to joint modeling of language and images that reaches new levels of performance in multiple vision-language tasks and multiple languages.

PaLI leverages the increased understanding capabilities unlocked by scaling the image and language unimodal components, benefitting especially from scaling up the vision backbone to a 4B parameter Vision Transformer (ViT). PaLI saves compute and resources by using these unimodal pre-trained models.

We also introduce the WebLI dataset, which includes 10 billion image-text pairs from 109 languages, and enables vision-language capabilities in many languages when paired with large-capacity models.

PaLI uses the same API (input: image + text; output: text) to solve multiple vision-language tasks, across many languages. These tasks include image-language, image-only, and language-only tasks, such as visual question answering, image captioning, classification, OCR, text reasoning and others.

In experiments, we observe that the PaLI model:

  • Achieves state-of-the-art performance, or better, on challenging image captioning and visual question answering benchmarks, including COCO-Captions, nocaps, TextCaps, VQAv2, OK-VQA, TextVQA, and VizWiz-QA
  • Exceeds prior models’ performance on multilingual visual captioning benchmarks, as well as on multilingual visual question answering benchmarks xGQA and MaXM.
  • Retains or improves capabilities on pure vision tasks, such as image classification, and in pure language tasks, such as question-answering and natural language inference.

Animation of the PaLI model showing how it responds to image and text prompts
PaLI is a simple, reusable and scalable architecture based on Transformer Encoders, including a Visual Transformer (ViT), a Multimodal Encoder and a Text Decoder. It can reuse previously trained models (mT5, ViT), and it is trained on WebLI to perform a wide range of tasks in the image-only, language-only, and image-language domains (e.g., visual-question answering, image captioning, scene-text understanding, etc.).

Scaling properties

We perform detailed comparisons of 3 sizes of PaLI models 3B, 15B, and 17B:



Plot showing the scaling properties of the PaLI model on various benchmarks

Scaling both the language and the visual components of the PaLI model contribute to improved performance. The plot shows the score differences compared to the PaLI-3B model: CIDEr score is used for evaluation of the image captioning tasks, whereas VQA tasks are evaluated using VQA Accuracy.

Benchmarks used in the figure above:
  • COCO-Captions is a classic Image Captioning benchmark that contains 1.5+ million human generated captions describing 330,000+ images.
  • NoCaps is a captioning benchmark that is similar to COCO-Captions, but focuses on the model's ability to describe objects outside COCO’s domain.
  • TextCaps asks the model to describe scenes with text in the image.
  • VQAv2 is a classic Visual Question-Answering benchmark that contains open-ended questions about images, which require an understanding of vision, language and commonsense knowledge to answer.
  • OK-VQA, where “OK” stands for “Outside Knowledge”, presents the additional challenge of specifically requiring external knowledge unavailable in the question and image to answer.
  • TextVQA asks questions about scenes with text in the image, similar to TextCaps.
  • VizWiz-QA contains images taken by visually-impaired people and their questions about them.

Visual question answering

VQAv2 is a classic Visual Question-Answering benchmark containing open-ended questions about images. These questions require an understanding of vision, language and commonsense knowledge to answer. The OK-VQA benchmark, where “OK” stands for “Outside Knowledge”, is a more difficult one in that it specifically requires external knowledge not included in the question and image in order to answer. The TextVQA benchmark, similar to TextCaps, asks questions about scenes with text in the image. The VizWiz-QA benchmark contains images taken by visually-impaired people and their questions about them.


Cows in a pasture
Author: Tom Check
Source: Flickr
License: CC BY-NC-SA 2.0
“Answer in EN: What meat do we get from these mammals?”
Beef
Stuffed animal with some hot dogs
Author: Lorianne DiSabato
Source: Flickr
License: CC BY-NC-ND 2.0
“Answer in EN: What isn't edible?”
Stuffed animal
Adult elephants with two calves
Author: Neil Page
Source: Flickr
License: CC BY-NC-ND 2.0
“Answer in EN: What is the appropriate name for babies of this species?”
Calf
An airplane in flight
Author: Simone
Source: Flickr
License: CC BY-NC-SA 2.0
“Answer in EN: People that make use of special frequent usage discounts when traveling on this vehicle are sometimes referred to as frequent what?”
Frequent flyer
Front-facing airplane on tarmac
Author: Halans
Source: Flickr
License: CC BY-NC-SA 2.0
“Answer in EN: How fast could you travel on this?”
500 mph
Building exterior and sidewalk lit by neon lights and shoes on a bench and graffiti art on a door
Author: Ben
Source: Flickr
License: CC BY-NC-ND 2.0
“Answer in EN: What is making the drawing on the door light up like that?”
Neon
Bird standing on the edge of a cup
Author: rosipaw
Source: Flickr
License: CC BY-NC-SA 2.0
“Answer in EN: What did this organism evolve from?”
Dinosaur
Sunflowers in a pot
Author: mk97007
Source: Flickr
License: CC BY 2.0
“Answer in ZH: 桶里放的是什么花?”
(Translation: “What flowers are in the bucket?”)
向日葵
(Translation: “Sunflower”)

Image captioning

Image captioning is the task of describing an image using natural language. It’s accomplished by first detecting the input image’s salient features, and then generating well-formed sentences that describe the image content in the expected output language.


A cellar filled with barrels of wine
Author: Andrea Tosatto
Source: Flickr
License: CC BY 2.0
“Generate the alt-text in EN”
A cellar filled with barrels of wine
Two helicopters are flying in the sky and one has a yellow stripe on the tail
“Generate the alt-text in EN”
Two helicopters are flying in the sky and one has a yellow stripe on the tail
An electric car parking on a charging station
Author: Per-Olof Forsberg
Source: Flickr
License: CC BY 2.0
“Generate the alt-text in EN”
an electric car is charging at a charging station
An electric car parking on a charging station
Author: Per-Olof Forsberg
Source: Flickr
License: CC BY 2.0
“Generate the alt-text in ZH”
一辆 电动 汽车 停 在 充电 桩 上 。
(Translation: “An electric car parking on a charging station”)
A gray donkey walks down the street
Author: Nacho
Source: Unknown
License: CC BY 2.0
“Generate the alt-text in TH”
ลา สี เทา เดิน ไป ตาม ถนน
(Translation: “A gray donkey walks down the street”)
A tree standing in a field with a purple sky
Author: teralaser
Source: Flickr
License: CC BY 2.0
“Generate the alt-text in FR”
Un arbre debout dans un champ avec un ciel violet
(Translation: “A tree standing in a field with a purple sky”)

Scene-text-oriented examples

Scene-text-oriented examples have been selected from VQA and Captioning tasks to highlight PaLI's OCR capabilities:


A clock on top of a building that says ‘ylvania’
Author: Alyson Hurt
Source: Flickr
License: CC BY 2.0
“Generate the alt-text in EN”
A clock on top of a building that says ‘ylvania’
A Liverpool city tour bus
Author: George M. Groutas
Source: Flickr
License: CC BY 2.0
“Answer in EN: what city name is mentioned on the bus”
liverpool
A table with electronics on it
Author: jwalsh
Source: Flickr
License: CC BY 2.0
“Answer in EN: What is the time on the device”
8:52
A book called Unlocking Harry Potter
Author: Listener42
Source: Flickr
License: CC BY 2.0
“Answer in EN: How much does this book cost?”
2.99

Language understanding

Since PaLI is pretrained with a diverse mixture of multimodal tasks with image and text data, we wanted to understand if the model would “forget” its language modeling capability, causing it to perform worse than its unimodal starting checkpoint on language-understanding tasks.

Therefore, we compare PaLI-17B to mT5-XXL on the following range of language understanding benchmarks:
  • English-only: the fine-tuned models (FT) were evaluated on SuperGLUE
  • Multilingual: the models were evaluated in the zero-shot (ZS) transfer setting on 3 benchmarks from the composite XTREME benchmark:
    • XNLI: a textual entailment task, in 14 languages
    • XQuAD: a question-answering task, in 10 languages
    • TyDiQA-GoldP: a question-answering task, in 11 languages

Plot showing the quality of the PaLI model on various language benchmarks, compared to mT5-XXL
PaLI maintains strong language-understanding capability on-par with the state-of-the-art mT5-XXL model. For SuperGLUE, the average score is presented. For the rest, F1 scores are presented. For the three XTREME benchmarks, we evaluate in the zero-shot (ZS) transfer setting, whereas for SuperGLUE the models are finetuned (FT).

Acknowledgements

We thank all the authors who conducted this research Xi Chen, Xiao Wang, Soravit (Beer) Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, Alexander Kolesnikov, Joan Puigcerver, Nan Ding, Keran Rong, Hassan Akbari,Gaurav Mishra, Linting Xue, Ashish Thapliyal, James Bradbury, Weicheng Kuo, Mojtaba Seyedhosseini, Chao Jia, Burcu Karagol Ayan, Carlos Riquelme, Andreas Steiner, Anelia Angelova, Xiaohua Zhai, Neil Houlsby, Radu Soricut.

We also thank Claire Cui, Slav Petrov, Tania Bedrax-Weiss, Joelle Barral, Tom Duerig, Paul Natsev, Fernando Pereira, Jeff Dean, Jeremiah Harmsen, Zoubin Ghahramani, Erica Moreira, Victor Gomes, Sarah Laszlo, Kathy Meier-Hellstern, Susanna Ricco, Rich Lee, Austin Tarango, Emily Denton, Bo Pang, Wei Li, Jihyung Kil, Tomer Levinboim, Julien Amelot, Zhenhai Zhu, Xiangning Chen, Liang Chen, Filip Pavetic, Daniel Keysers, Matthias Minderer, Josip Djolonga, Ibrahim Alabdulmohsin, Mostafa Dehghani, Yi Tay, Elizabeth Adkison, James Cockerille, Eric Ni, Anna Davies, Alex Ku, and Maysam Moussalem for their suggestions, improvements and support.

We thank the Parti Team for developing the template for this site.

We thank Tom Small for providing visualizations.

Sept 16, 2022