Research paper
Blog post
GitHub repository
We perform detailed comparisons of 3 sizes of PaLI models 3B, 15B, and 17B:
VQAv2 is a classic Visual Question-Answering benchmark containing open-ended questions about images. These questions require an understanding of vision, language and commonsense knowledge to answer. The OK-VQA benchmark, where “OK” stands for “Outside Knowledge”, is a more difficult one in that it specifically requires external knowledge not included in the question and image in order to answer. The TextVQA benchmark, similar to TextCaps, asks questions about scenes with text in the image. The VizWiz-QA benchmark contains images taken by visually-impaired people and their questions about them.
Image captioning is the task of describing an image using natural language. It’s accomplished by first detecting the input image’s salient features, and then generating well-formed sentences that describe the image content in the expected output language.
Scene-text-oriented examples have been selected from VQA and Captioning tasks to highlight PaLI's OCR capabilities:
We thank all the authors who conducted this research Xi Chen, Xiao Wang, Soravit (Beer) Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, Alexander Kolesnikov, Joan Puigcerver, Nan Ding, Keran Rong, Hassan Akbari,Gaurav Mishra, Linting Xue, Ashish Thapliyal, James Bradbury, Weicheng Kuo, Mojtaba Seyedhosseini, Chao Jia, Burcu Karagol Ayan, Carlos Riquelme, Andreas Steiner, Anelia Angelova, Xiaohua Zhai, Neil Houlsby, Radu Soricut.
We also thank Claire Cui, Slav Petrov, Tania Bedrax-Weiss, Joelle Barral, Tom Duerig, Paul Natsev, Fernando Pereira, Jeff Dean, Jeremiah Harmsen, Zoubin Ghahramani, Erica Moreira, Victor Gomes, Sarah Laszlo, Kathy Meier-Hellstern, Susanna Ricco, Rich Lee, Austin Tarango, Emily Denton, Bo Pang, Wei Li, Jihyung Kil, Tomer Levinboim, Julien Amelot, Zhenhai Zhu, Xiangning Chen, Liang Chen, Filip Pavetic, Daniel Keysers, Matthias Minderer, Josip Djolonga, Ibrahim Alabdulmohsin, Mostafa Dehghani, Yi Tay, Elizabeth Adkison, James Cockerille, Eric Ni, Anna Davies, Alex Ku, and Maysam Moussalem for their suggestions, improvements and support.
We thank the Parti Team for developing the template for this site.
We thank Tom Small for providing visualizations.
Sept 16, 2022