We're excited to welcome Google's new Visual Language Model!PaliGemma 2This is a new version of PaliGemma. Like its predecessor, PaliGemma 2 uses powerfulSigLIP for visual processing, but upgraded in the text decoding section to the latestGemma 2。
Model size and input resolution
PaliGemma 2 provides new pre-trained models with parameter scales including3B 、 10B respond in singing28B. All models support multiple input resolutions as follows.
- 224x224
- 448x448
- 896x896
This diverse portfolio offers great flexibility for different usage scenarios, allowing practitioners to make choices based on a balance between quality and efficiency needs. In contrast, the previous generation of PaliGemma only offered3B Version.
Pre-training and fine-tuning capabilities
These pre-trained models are designed to be more easily adapted to downstream tasks. The first PaliGemma model was used by the community for a wide range of tasks due to its broad adaptability. This iteration introduces higher quality pre-trained models and more options to further enhance flexibility.
Example of DOCQI dataset
Google has released a number of new features this time based onDOCCI Fine-tuned models of the dataset demonstrate the ability to describe long, detailed and expressive images. These fine-tuning models provide3B respond in singing10B Two versions with input resolution support448x448。
This release includes all open model repositories, Transformers framework integrations, fine-tuning scripts, and all of our model repositories based on theVQAv2 dataset Fine-tuned visual Q&A modeling demos. These resources provide users with comprehensive tool support to help explore and develop more innovative applications.
resource link
This release includes open source model libraries, transformers integration, fine-tuning scripts, and a visual quiz demo. Here are links to related resources.
- Release Collection
- Fine-tuning scripts
- Fine-tuning Model Demo
- Technical Report
PaliGemma 2 Introduction
PaliGemma 2 YesPaliGemma Visual Language Model a new iteration of it, released by Google in May.
PaliGemma 2 combines the powerful SigLIP image encoder with theGemma 2 Language Model Connections.
The new model is based onGemma 2 (used form a nominal expression)2B 、9B respond in singing27B Language models, corresponding respectively to3B 、10B respond in singing28B of the PaliGemma 2 variant. The names of these models take into account the additional parameters of the compact image encoder. As mentioned above, these models support three different resolutions, providing a great deal of flexibility for fine-tuning downstream tasks.
PaliGemma 2 based onGemma License distribution, the license allows for redistribution, commercial use, fine-tuning, and creation of model derivatives.
This release includes the following features based on thebfloat16 Checkpoints for accuracy.
-
9 pre-trained models: 3B, 10B and 28B, resolution support
- 224x224
- 448x448
- 896x896
-
2 fine-tuned models on the DOCCI dataset: Based onDOCCI dataset (image-text pairing) that supports the3B respond in singing10B of the PaliGemma 2 variant with an input resolution of448x448。
modeling capability
As with previous PaliGemma releases, the pre-trained (pt) model performs well in fine-tuning downstream tasks.
Pre-training dataset
The pt model was pre-trained on the following data mixtures. These diverse pre-training datasets allow the model to be fine-tuned for downstream tasks in similar domains using fewer examples.
- WebLI: a large-scale multilingual image-text dataset constructed on the public web.The diverse segmentation of the WebLI dataset equips the model with multifaceted capabilities such as visual semantic understanding, object localization, visual text understanding, and multilingual capabilities.
- CC3M-35L: Handpicked English images from web pages - alternative text dataset (Sharma et al., 2018). The labeling of the dataset is done byGoogle Cloud Translation API Translated into 34 additional languages.
- Visual Question Generation with Question Answering Validation (VQ2A):: An improved question-answer dataset. The dataset was also translated into the same 34 languages using the Google Cloud Translation API.
- OpenImages: Q&A dataset for detection and object perception (Piergiovanni et al., 2022), generated by manual rules based on theOpenImages dataset。
- WIT: Image and text datasets collected from Wikipedia (Srinivasan et al., 2021).
Fine-tuning models and benchmarking
The PaliGemma 2 team has internally fine-tuned PT models on a variety of visual language understanding tasks and provides benchmark results for these fine-tuned models. Detailed information can be found in themodel card respond in singingTechnical Report Found in.
PaliGemma 2 is based onDOCQI data set Fine-tuning enables a variety of image description tasks, including text rendering, capturing spatial relationships, and descriptions that incorporate world knowledge.
performance comparison
The following table shows the performance of the DOCQI fine-tuning model compared to other models (data fromTechnical Report Table 6).
mould | quantity of participants | Number of characters (#char) | Number of sentences (#sent) | NES ↓ |
---|---|---|---|---|
MiniGPT-4 | 7B | 484 | 5.6 | 52.3 |
mPLUG-Owl2 | 8B | 459 | 4.4 | 48.4 |
InstructBLIP | 7B | 510 | 4.0 | 42.6 |
LLAVA-1.5 | 7B | 395 | 4.2 | 40.6 |
VILA | 7B | 871 | 8.6 | 28.6 |
PaliGemma | 3B | 535 | 8.9 | 34.3 |
PaLI-5B | 5B | 1065 | 11.3 | 32.9 |
PaliGemma 2 | 3B | 529 | 7.7 | 28.4 |
PaliGemma 2 | 10B | 521 | 7.5 | 20.3 |
Indicator Description.
- #char: The average number of characters in the generated description.
- #sent:: Average number of sentences.
- NES: Number of non-implied sentences (lower values are better), used as a measure of factual inaccuracy.
You can find below some of the model outputs from the DOCQI checkpoints, demonstrating the diversity and flexibility of the models.
Input Image | Caption |
---|---|
The line graph shows the Top-1 accuracy of the ImageNet model after fine-tuning. There are four different colored lines: blue, orange, green, and black.The blue line is the lowest of the four lines , which represents the worst performing model result. | |
Close-up shot of a sheet of white paper with contents printed in black. The paper is slightly curved in the center and the text is rendered using a typewriter font. The top of the paper reads "Ashley Hotel West Coast", under which is "WiFi Internet Service" And then below that is "Username: fqpp" and finally "Password: aaeu"。 | |
A mural depicting David Bowie in the shape of Ziggy Stardust is painted on a white wall. The mural shows three faces side by side, each with red hair and blue lightning bolts painted over their eyes. The faces' makeup includes blue eye shadow, pink blush and red lips. Above the center face is a black square window with white text that reads "JAM" in blue font. A silver automobile is parked on one side of the screen. | |
View from above of a white marble countertop with four coffee cups on it. There are two gray mugs on the left, a white mug in the lower left corner, and another gray mug on the right. A metal fruit basket with a wooden base filled with oranges sits in the upper right corner. There is also a clear glass pitcher filled with water on the left, which is only partially shown in the image. | |
A close-up of a white book with a white area on the top half and a blue stripe on the bottom. The white area is printed with black text that reads:"Visual Concept Learning from User-tagged Web Video" . Below the black text is a white box containing five small pictures. The image on the far left is of a man standing in a meadow, and immediately to the right is a picture of a blue ocean. |
demonstrations
To demonstrate the effect, the Hugging Face team has made thePaliGemma 2 3B The model was fine-tuned with an input resolution of 448x448 and a dataset using theVQAv2 A small part of the We used theLoRA fine-tuning respond in singingPEFT method, the details of which will be explained in the fine-tuning section.
The following demo shows the end result. Feel free to look at the code in Space to see how it works, or clone it to fit your custom fine-tuning needs.
How to use with Transformers
You can use the 🤗 Transformers library for thePaliGemma 2 The model performs inference byPaliGemmaForConditionalGeneration respond in singingAutoProcessor APIs to realize the operation. Please make sure that the version of Transformers you have installed is4.47 or higher:
pip install transformers>=4.47
After the installation is complete, you can run the inference as shown in the example below. It is also important to ensure that you follow the task prompt format used to train the model for best results:.
from transformers import AutoProcessor, PaliGemmaForConditionalGeneration
from PIL import Image
import requests
model_id = "google/paligemma2-10b-ft-docci-448"
model = PaliGemmaForConditionalGeneration.from_pretrained(model_id)
model = ("cuda")
processor = AutoProcessor.from_pretrained(model_id)
prompt = "<image>caption en"
image_file = "/datasets/huggingface/documentation-images/resolve/main/"
raw_image = ((image_file, stream=True).raw).convert("RGB")
inputs = processor(prompt, raw_image, return_tensors="pt").to("cuda")
output = (**inputs, max_new_tokens=200)
input_len = inputs["input_ids"].shape[-1]
print((output[0][input_len:], skip_special_tokens=True))
# A medium shot of two cats laying on a pile of brown fishing nets. The cat in the foreground is a gray tabby cat with white on its chest and paws. The cat is laying on its side with its head facing the bottom right corner of the image. The cat in the background is laying on its side with its head facing the top left corner of the image. The cat's body is curled up, its head is slightly turned to the right, and its front paws are tucked underneath its body. There is a teal rope hanging from the fishing net in the top right corner of the image.
You can also use thebitsandbytes
to load models with quantization. The following example uses the4-bit nf4
:
from transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
model = PaligemmaForConditionalGeneration.from_pretrained(
model_id,
quantization_config=bnb_config,
device_map={"":0}
)
We quickly tested the impact of quantization on performance by evaluating the effect of a 3B fine-tuning checkpoint in thetextvqa Performance on the dataset, using 224x224 input images. Here are the results we obtained on 5,000 validation set entries:.
- bfloat16No quantification.60.04% Accuracy.
- 8-bit: 59.78%。
- 4-bit, using the configuration in the code snippet above.58.72%。
These results are very encouraging! Of course, quantification makes more sense for larger checkpoints, and we recommend that you always measure results on the domains and tasks you are using.
trimming
If you have previously fine-tunedPaliGemmaThen for fine-tuningPaliGemma 2 The API is the same and you can use the existing code directly. We provideFine-tuning scripts and anotebook to help you fine-tune your model, freeze some of the model's parameters, or apply memory-efficient fine-tuning techniques such asLoRA maybeQLoRA。
We useLoRA The PaliGemma 2 model was fine-tuned for demonstration on half of the VQAv2 validation set. This task was performed using the3 A100 Graphics card (80GB VRAM) in half an hour.
You can find more information on thehere are Find the model, in additionThis Gradio demo Demonstrates the effect of modeling.
reach a verdict
newly releasedPaliGemma 2 It's even more exciting than the previous version, with different scales to meet a variety of needs and more powerful pre-trained models. We look forward to seeing what the community can build!
We thank the Google team for releasing this amazing and open model series. Special thanks toPablo Montalvo Integration of models into Transformers, andLysandre、Raushan、Arthur、Yieh-Dar Working with the rest of the team, they quickly completed the review, testing, and merging of the models.
resource (such as manpower or tourism)
- Release Collection
- PaliGemma Blog Posts
- Fine-tuning scripts
- Fine-tuning the model on VQAv2
- Demonstration of fine-tuned models
- Technical Report
Original in English./blog/paligemma2
Original authors: Merve Noyan, Andreas P. Steiner, Pedro Cuenca, Aritra Roy Gosthipaty
Translator: xiaodouzi666