(Book) Domestic open source visual language model CogVLM2 online experience: even recognize the black Wukong

CogVLM2 is a Visual Language Model (VLM), which is developed by Wisdom Spectrum AI and Tsinghua KEG. This model is an upgraded version of CogVLM, which supports image resolution up to 1344 * 1344, and provides an open source version of the model that supports both Chinese and English.

These kinds of models can do a lot of cross-domain jobs, such as assigning descriptive text to pictures, answering questions about pictures (this is called VQA, which is visual question and answer), or finding the corresponding pictures based on the descriptions, and so on. To do these tasks better, CogVLM2 uses more advanced designs and techniques, such as training with larger amounts of data, deeper neural network structures, and smarter training methods.

The progress of CogVLM is mainly attributed to a core concept: "visual first". Previous multimodal models often put image features on the same level as text features, and the part used to process the image is usually simpler, so the image is like a "supporting role" of the text, so the effect is average. CogVLM, on the other hand, allows visual information to occupy a more important position.

environmental preparation

local deployment

CogVLM has released open source program code on Github to do image inference, video inference, and even fine-tuning of models (though with high GPU resource requirements), Github address:/THUDM/CogVLM2

Linux is recommended, with an NVIDIA GPU with a minimum of 16G of video memory.

For details on how to install and use the program, you can read this official introduction:

/THUDM/CogVLM2/blob/main/basic_demo/README_zh.md

Using a cloud environment

If you don't have enough local GPU resources, know nothing about programming, or just want to see the results first, you can use my packaged cloud platform image to start it with one click and run it directly without wasting time.

The cloud platform has a complimentary amount for new users, which is enough to experience the app, and the platform registration address:

To experience picture reasoning only, without any technical manipulation, please open this URL:/applicationMarket/applicationDetails?appId=39&IC=XLZLpI7Q

Once the app has been created, you can open it in the Console - > My Apps.

Because of platform limitations, if you still want to use the API or do video reasoning, open this URL: /postDetail/656

Click "Create Example" at the bottom right of the page:

Note that if you're doing video reasoning, because it requires more resources, you'll need to select 2 cards here to make it run:

Once the instance has started successfully, we can open the JupyterLab Interactive Tool for the corresponding instance in the Console - > Container Instances.

In JupyterLab you can select the function you want to use on the left, start the application on the right, and view the runtime log.

Then go back to the Container Instance page and click "Public Access" to get the extranet access address of the corresponding program.

Image Reasoning WebUI instructions for use

1. After the container instance is started successfully, find the corresponding instance in the instance list page and click "JupyterLab" in the operation.

2, in the open page, click the "Basic Page Launcher", and then continue to click the restart button on the page to start the corresponding program, as shown in the figure below:

3. After the program starts successfully, go back to the instance list page and click "Public Access":

Copy the first link in it and open it in your browser.

4. After opening the app in the browser, at the bottom of the page:

(1) First upload a picture;

(2) Then ask your questions about this picture.

Here's a demonstration using one of Black Goku's photos, with the following results:

To open a new session, click this button in the upper right corner of the page:

Instructions for using the Image Reasoning API

1. After the container instance is started successfully, find the corresponding instance in the instance list page and click "JupyterLab" in the operation.

2, in the open page click on the "Basic API Launcher", and then continue to click the restart button on the page to start the corresponding program, as shown in the figure below:

3. After the program starts successfully, go back to the instance list page and click "Public Access":

The 2nd of these links is the API access address.

For the code to access the API, please refer to:

/THUDM/CogVLM2/blob/main/basic_demo/openai_api_request.py

Note: The image inference API is a separate program, and will shut down the page inference program when using a single graphics card. For simultaneous startup, dual graphics cards are required and CUDA_VISIBLE_DEVICES=1 in CogVLM2/startup/start_basic_api.sh should be modified.

Video Reasoning Instructions

1, video reasoning requires more video memory, in the Goodyear platform requires 2 4090D video cards, so you need to choose 2 cards when creating an instance, as shown in the following figure:

2. After the container instance is started successfully, find the corresponding instance in the instance list page and click "JupyterLab" in the operation.

3, in the open page, click on the "Video Recognition Launcher", and then continue to click the restart button on the page to start the corresponding program, as shown in the figure below:

4. After the program starts successfully, go back to the instance list page and click "Public Access":

Two of these connections provide access to web pages and APIs, respectively.

5. In the browser after opening the page, the page:

(1) First upload a video (1 minute or less);

(2) Then ask your questions about this video.

6. Using the Video Reasoning API

The reference code is as follows, please take care to replace the API address and local video file path in it.

import requests
url = 'http://127.0.0.1:7861/video_qa'
video_file = "../resources/videos/lion.mp4"
question = "Describe this video in detail."
temperature=0.2
files = {'video': open(video_file, 'rb')}
data = {'question': question,'temperature': temperature}
response = (url, files=files, data=data)
print(()["answer"])

Participate in book giveaways

In order to give back to the readers, firefly gentleman and mechanical industry publishing house to engage in a book giveaway activities, is the side of this machine learning one of the four famous books "machine learning" new upgrade of the 3rd edition! Chinese Douban rating of 9.6! Readers recognized as one of the most friendly machine learning books for entry and practice!

One of the machine learning books that readers recognize as extremely friendly for getting started and practicing!
Concrete examples + simple theory + Python frameworks that can be used in production environments
Help you visualize and understand the concepts and tools needed to build intelligent systems
Comes with tons of code examples to help you learn and use!

If you want to receive the book, please send a message "Machine Learning in Action" to the public/public/number "Firefly Walking AI" to be entered into the drawing, which will be held on September 9 at 10:00 am!