Research paper Blog post GitHub repository

Introduction

We introduce the Pathways Language and Image (PaLI) model, a scalable approach to joint modeling of language and images that reaches new levels of performance in multiple vision-language tasks and multiple languages.

PaLI leverages the increased understanding capabilities unlocked by scaling the image and language unimodal components, benefitting especially from scaling up the vision backbone to a 4B parameter Vision Transformer (ViT). PaLI saves compute and resources by using these unimodal pre-trained models.

We also introduce the WebLI dataset, which includes 10 billion image-text pairs from 109 languages, and enables vision-language capabilities in many languages when paired with large-capacity models.

PaLI uses the same API (input: image + text; output: text) to solve multiple vision-language tasks, across many languages. These tasks include image-language, image-only, and language-only tasks, such as visual question answering, image captioning, classification, OCR, text reasoning and others.

In experiments, we observe that the PaLI model:

Achieves state-of-the-art performance, or better, on challenging image captioning and visual question answering benchmarks, including COCO-Captions, nocaps, TextCaps, VQAv2, OK-VQA, TextVQA, and VizWiz-QA
Exceeds prior models’ performance on multilingual visual captioning benchmarks, as well as on multilingual visual question answering benchmarks xGQA and MaXM.
Retains or improves capabilities on pure vision tasks, such as image classification, and in pure language tasks, such as question-answering and natural language inference.

Animation of the PaLI model showing how it responds to image and text prompts

PaLI is a simple, reusable and scalable architecture based on Transformer Encoders, including a Visual Transformer (ViT), a Multimodal Encoder and a Text Decoder. It can reuse previously trained models (mT5, ViT), and it is trained on WebLI to perform a wide range of tasks in the image-only, language-only, and image-language domains (e.g., visual-question answering, image captioning, scene-text understanding, etc.).

Scaling properties

We perform detailed comparisons of 3 sizes of PaLI models 3B, 15B, and 17B:

Scaling both the language and the visual components of the PaLI model contribute to improved performance. The plot shows the score differences compared to the PaLI-3B model: CIDEr score is used for evaluation of the image captioning tasks, whereas VQA tasks are evaluated using VQA Accuracy.

Benchmarks used in the figure above:

COCO-Captions is a classic Image Captioning benchmark that contains 1.5+ million human generated captions describing 330,000+ images.
NoCaps is a captioning benchmark that is similar to COCO-Captions, but focuses on the model's ability to describe objects outside COCO’s domain.
TextCaps asks the model to describe scenes with text in the image.
VQAv2 is a classic Visual Question-Answering benchmark that contains open-ended questions about images, which require an understanding of vision, language and commonsense knowledge to answer.
OK-VQA, where “OK” stands for “Outside Knowledge”, presents the additional challenge of specifically requiring external knowledge unavailable in the question and image to answer.
TextVQA asks questions about scenes with text in the image, similar to TextCaps.
VizWiz-QA contains images taken by visually-impaired people and their questions about them.

Visual question answering

VQAv2 is a classic Visual Question-Answering benchmark containing open-ended questions about images. These questions require an understanding of vision, language and commonsense knowledge to answer. The OK-VQA benchmark, where “OK” stands for “Outside Knowledge”, is a more difficult one in that it specifically requires external knowledge not included in the question and image in order to answer. The TextVQA benchmark, similar to TextCaps, asks questions about scenes with text in the image. The VizWiz-QA benchmark contains images taken by visually-impaired people and their questions about them.

Author: Tom Check
Source: Flickr
License: CC BY-NC-SA 2.0

“Answer in EN: What meat do we get from these mammals?”

Beef

Author: Lorianne DiSabato
Source: Flickr
License: CC BY-NC-ND 2.0

“Answer in EN: What isn't edible?”

Stuffed animal

Author: Neil Page
Source: Flickr
License: CC BY-NC-ND 2.0

“Answer in EN: What is the appropriate name for babies of this species?”

Calf

Author: Simone
Source: Flickr
License: CC BY-NC-SA 2.0

“Answer in EN: People that make use of special frequent usage discounts when traveling on this vehicle are sometimes referred to as frequent what?”

Frequent flyer

Author: Halans
Source: Flickr
License: CC BY-NC-SA 2.0

“Answer in EN: How fast could you travel on this?”

500 mph

Building exterior and sidewalk lit by neon lights and shoes on a bench and graffiti art on a door

Author: Ben
Source: Flickr
License: CC BY-NC-ND 2.0

“Answer in EN: What is making the drawing on the door light up like that?”

Neon

Author: rosipaw
Source: Flickr
License: CC BY-NC-SA 2.0

“Answer in EN: What did this organism evolve from?”

Dinosaur

Author: mk97007
Source: Flickr
License: CC BY 2.0

“Answer in ZH: 桶里放的是什么花？”
(Translation: “What flowers are in the bucket?”)

向日葵
(Translation: “Sunflower”)

Image captioning

Image captioning is the task of describing an image using natural language. It’s accomplished by first detecting the input image’s salient features, and then generating well-formed sentences that describe the image content in the expected output language.

Author: Andrea Tosatto
Source: Flickr
License: CC BY 2.0

“Generate the alt-text in EN”

A cellar filled with barrels of wine

“Generate the alt-text in EN”

Two helicopters are flying in the sky and one has a yellow stripe on the tail

An electric car parking on a charging station

Author: Per-Olof Forsberg
Source: Flickr
License: CC BY 2.0

“Generate the alt-text in EN”

an electric car is charging at a charging station

Author: Per-Olof Forsberg
Source: Flickr
License: CC BY 2.0

“Generate the alt-text in ZH”

一辆电动汽车停在充电桩上。
(Translation: “An electric car parking on a charging station”)

Author: Nacho
Source: Unknown
License: CC BY 2.0

“Generate the alt-text in TH”

ลา สี เทา เดิน ไป ตาม ถนน
(Translation: “A gray donkey walks down the street”)

A tree standing in a field with a purple sky

Author: teralaser
Source: Flickr
License: CC BY 2.0

“Generate the alt-text in FR”

Un arbre debout dans un champ avec un ciel violet
(Translation: “A tree standing in a field with a purple sky”)

Scene-text-oriented examples

Scene-text-oriented examples have been selected from VQA and Captioning tasks to highlight PaLI's OCR capabilities:

A clock on top of a building that says ‘ylvania’

Author: Alyson Hurt
Source: Flickr
License: CC BY 2.0

“Generate the alt-text in EN”

A clock on top of a building that says ‘ylvania’

Author: George M. Groutas
Source: Flickr
License: CC BY 2.0

“Answer in EN: what city name is mentioned on the bus”

liverpool

Author: jwalsh
Source: Flickr
License: CC BY 2.0

“Answer in EN: What is the time on the device”

8:52

Author: Listener42
Source: Flickr
License: CC BY 2.0

“Answer in EN: How much does this book cost?”

2.99

Language understanding

Since PaLI is pretrained with a diverse mixture of multimodal tasks with image and text data, we wanted to understand if the model would “forget” its language modeling capability, causing it to perform worse than its unimodal starting checkpoint on language-understanding tasks.

Therefore, we compare PaLI-17B to mT5-XXL on the following range of language understanding benchmarks:

English-only: the fine-tuned models (FT) were evaluated on SuperGLUE
Multilingual: the models were evaluated in the zero-shot (ZS) transfer setting on 3 benchmarks from the composite XTREME benchmark:
- XNLI: a textual entailment task, in 14 languages
- XQuAD: a question-answering task, in 10 languages
- TyDiQA-GoldP: a question-answering task, in 11 languages

Plot showing the quality of the PaLI model on various language benchmarks, compared to mT5-XXL

PaLI maintains strong language-understanding capability on-par with the state-of-the-art mT5-XXL model. For SuperGLUE, the average score is presented. For the rest, F1 scores are presented. For the three XTREME benchmarks, we evaluate in the zero-shot (ZS) transfer setting, whereas for SuperGLUE the models are finetuned (FT).

Acknowledgements

We thank all the authors who conducted this research Xi Chen, Xiao Wang, Soravit (Beer) Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, Alexander Kolesnikov, Joan Puigcerver, Nan Ding, Keran Rong, Hassan Akbari,Gaurav Mishra, Linting Xue, Ashish Thapliyal, James Bradbury, Weicheng Kuo, Mojtaba Seyedhosseini, Chao Jia, Burcu Karagol Ayan, Carlos Riquelme, Andreas Steiner, Anelia Angelova, Xiaohua Zhai, Neil Houlsby, Radu Soricut.

We also thank Claire Cui, Slav Petrov, Tania Bedrax-Weiss, Joelle Barral, Tom Duerig, Paul Natsev, Fernando Pereira, Jeff Dean, Jeremiah Harmsen, Zoubin Ghahramani, Erica Moreira, Victor Gomes, Sarah Laszlo, Kathy Meier-Hellstern, Susanna Ricco, Rich Lee, Austin Tarango, Emily Denton, Bo Pang, Wei Li, Jihyung Kil, Tomer Levinboim, Julien Amelot, Zhenhai Zhu, Xiangning Chen, Liang Chen, Filip Pavetic, Daniel Keysers, Matthias Minderer, Josip Djolonga, Ibrahim Alabdulmohsin, Mostafa Dehghani, Yi Tay, Elizabeth Adkison, James Cockerille, Eric Ni, Anna Davies, Alex Ku, and Maysam Moussalem for their suggestions, improvements and support.

We thank the Parti Team for developing the template for this site.

We thank Tom Small for providing visualizations.

Sept 16, 2022

PaLI

Pathways Language and Image Model

Introduction

Scaling properties

Visual question answering

Image captioning

Scene-text-oriented examples

Language understanding

Acknowledgements