The Secrets Behind Building FineVideo Datasets Revealed

Open video datasets are scarce, thus slowing down the development of open source video AI. To this end, we buildFineVideoThis is a dataset of 43,000 videos totaling 3,400 hours with rich descriptions, narrative details, scene segmentation, and Q&A pairs.

FineVideo contains a highly diverse collection of videos and metadata, making it good material for training models to understand video content, for training diffusion models to generate videos from textual descriptions, or for training computer vision models using their structured data as input.

Wait, haven't you seen FineVideo yet? viaDataset Exploration Page Check it out.

About this blog post

In this blog post, we share the technical details and code for developing FineVideo: from theYouTube-Commons The process started with 1.9 million videos in the first quarter of 2010 and ended with 44,000 videos with detailed annotations.

A good way to start is to look at the different steps of our journey. These steps involve content filtering, annotation and output structuring.

FineVideo 视频过滤和标注管道

FineVideo Video Filtering and Labeling Pipeline

In the next sections, we will discuss each step and provide references to the relevant code sections. If you prefer to browse through the code directly, check out our post on theGithub on the FineVideo repository.

First, let's take a look at how we got our initial list of YouTube videos and applied some initial filtering.

Constructing the original dataset

Our journey begins atYouTube-Commons Start: This is a collection of video-audio transcriptions shared on YouTube under the CC-By license. The project was created byPleIAs Created and maintained as part of their corpus collection project.

Filtering YouTube-Commons

YouTube Commons contains videos and transcriptions in multiple languages, and our initial task was to narrow down its content to the same language.

We filter English-language videos in YouTube-Commons while collecting relevant metadata. Through this initial filtering, we collected 1.9 million videos, their closed captions, and metadata.

Here are some details of the metadata fields that are filtered and retained.

filtration

field	filter value	descriptive
original_language	en	English Video
transcription_language	en	English transcription

metadata field

field	descriptive
acodec	Audio Codecs
age_limit	Age limit for YouTube videos
categories	YouTube Video Categories
channel	YouTube channel
channel_follower_count	Number of channel subscribers
channel_id	YouTube channel identifier
character_count	Number of characters in closed captions
comment_count	Number of YouTube comments
description	YouTube Video Description
duration_string	Video duration in hh:mm:ss format
license	Video License
like_count	Number of YouTube video likes
resolution	Video resolution in width x height format
tags	Free text labels for YouTube videos
text	Closed captions
title	YouTube Video Title
upload_date	YouTube Upload Date
vcodec	video codec
video_id	YouTube Video Identifier
view_count	YouTube views
word_count	Number of words in closed captions

The code for content filtering and metadata collection can be found here [link (on a website)]

Download Video

Once we had a list of 1.9 million targeted videos, we managed to download 1.8 million videos (some videos were deleted by the channel owner and some had their permissions changed).

We explore two different approaches to distributed downloading. Option 1: Video2dataset

video2dataset is an open source project [link (on a website)], focuses on distributed video downloads, conversion and packaging into different dataset formats. The project natively supports the Slurm workload manager, so we can run it on our CPU cluster.

来源: Video2Dataset GitHub 页面

Source: Video2Dataset GitHub page

Since all our cluster instances access the Internet through the same public IP, we contributed to the project the possibility to specify a proxy to facilitate video downloads. Although the feature is not yet merged, you can access it via our PR [link (on a website)] patch video2dataset to use the proxy function.

Option 2: Cloud Batch Job

Most cloud providers have the possibility to run jobs by simply defining the type of instances that will execute each job, defining queues and providing containers that contain the code that will be executed.

We use Google Cloud and AWS to run a homebrew Docker container using theytdlp Download video and metadata and push the results to S3.

The files for building Docker containers can be found here [coding]。

Our conclusions

While Video2Dataset worked with proxies and allowed us to perform additional processing steps, the number of requests per second we could make to the proxies became a bottleneck. This led us to turn to cloud batch jobs.

Preservation of dynamic content

In our best video search, we narrowed down the selection to content that has both visual action and moderately fast speech. We accomplish this through word density filtering and visual dynamism filtering.

Word Density Filtering

We use word density in video as a proxy for audio dynamics. Word density is defined as.

Word density = number of words in closed captions / total video duration (sec)

By sampling at different density thresholds and visually assessing the quality of the content, we decided to remove all videos with word densities below 0.5 words/second.

Example.

word density	typical example
0.25	Click to view sample video
0.5	Click to view sample video
0.75	Click to view sample video
1.0	Click to view sample video

The code for the word density filtering and exploration example can be found here [link (on a website)]

Visual Dynamics Filtering

We reused FFMPEG'sFreezedetect filter to determine the dynamic nature of the video. Although this filter is designed to recognize frozen portions of the video (multiple identical frames placed consecutively), we can determine the motion of the video by placing thenoise parameter is set to a very high value to recognize low motion blocks.

Instead of running freezedetect on the entire video, we analyze the video by time segments and vote whether the video is static or not based on the number of segments that are classified as static. Through manual evaluation, we set a threshold to discard a video if 40% of the analyzed clips are low motion.

Some of the types of content that are discarded after this filtering:

typology	typical example
Still images with music	Click to view sample video
Demo Screen Recording	Click to view sample video
Highly static person talking to camera	Click to view sample video

DockerFile and code for categorized video dynamism can be found here [link (on a website)]

From the 1.8 million videos analyzed, we retained 600,000 motion videos after this step. At this stage, we delve deeper into the content of the videos, which is essential to ensure the diversity of the dataset.

Video Categories

To achieve the most diverse selection of content, we categorized 600,000 filtered assets using closed captions and YouTube metadata. To gain control over the categorization process, we created a taxonomy and guided the annotation process to follow it.

Custom-built taxonomies

We used GPT4-o to bootstrap a custom-built taxonomy that was reviewed and adapted by information scientists. The taxonomy contains 126 sub-categories organized into multiple levels. This multi-level approach allows FineVideo's users to slice and dice the dataset according to their specific use cases.

分类法

The taxonomy can also be found in JSON [link (on a website)]

With the initial version of the taxonomy, we started content labeling and by looking at the results of content labeling, with the help of information scientists, we adjusted the taxonomy accordingly.

content labeling

We use Llama 3.1 70B via Text Generation InferenceTGI [coding] Categorize the video.

Prompts require multiple iterations to ensure that the answer is strictly a category in our taxonomy. During our prompt evaluation, we found that by removing existing YouTube tags and categories from the prompts, the quality of the results improved significantly: the YouTube metadata biased the text generated by Llama 3.1 towards one of the categories provided by YouTube.

prompt_template = """
Given these categories: {leaves}
Categorize YouTube videos based on their closed captions and some metadata details. Returns only the selected category, nothing else!
Title: {title}
Description: {description}
Channel: {channel}
Closed captions: {closed_caption}
"""

Taxonomy Feedback Loops - Content Labeling

分类法调整在内容分类过程中的反馈循环

Feedback loops for taxonomy adaptation in the content categorization process

One of the roles of the information scientist is to tweak the taxonomy over time to add new categories or to add some additional differentiation if needed.

Using LLMs for content categorization reduces taxonomy tuning time from months/years to hours. Additionally, in some cases, we created categories specifically for discarding sensitive videos, such as those belonging to theFirearms & Weapons cap (a poem)Substance Use & Drugs The video.

Contributing descriptive metadata

At this stage, we have three sources of video-level metadata: the

Video category (extrapolated using Llama 3.1)
YouTube metadata (title, description)
Transcription for YouTube-Commons

In order to contribute to the field of video understanding, we decided to dive into timecode-level metadata, such as activity, object, narrative and editing aspects.

While we considered manual annotation as part of an active learning setup, where one or more models propose annotations and QA steps are performed manually, we found Gemini to be a good solution, especially given our limitations on the length of the input video and the output format.

Long video & Gemini 1.5 Pro

We dug into Gemini 1.5 Pro, iterating on our tips and testing different content lengths.

Since it is limited to 1M tokens, which is roughly equivalent to ~1 hour of video, we are forced to discard videos longer than 1 hour.

To overcome this, we try to speed up videos longer than 1 hour to fit the context of Gemini. d

探索: 加速视频以适应 Gemini 的上下文

Explore: Accelerating Video to Fit the Context of Gemini

While it seemed to work at a high level, when we started looking at the details, we found that only the first few minutes of the video were accurately labeled.

Noticing that quality drops on longer videos, we wondered if this affects the rest of our videos? By sampling videos of different lengths and checking the labeled video coverage, we found a drop in quality for videos longer than 10 minutes.

In keeping with our goal of providing high quality data to the community, we discarded more than 10 minutes of video.

Content Selection

Since it costs more than $5 per hour of video to tag using Gemini, we can't tag all filtered videos. Therefore, we want to make sure we have good coverage on all topics and find a good balance between content diversity and budget. We set this size constraint to 4,000 hours of video.

To select 4,000 hours of content from 600,000 videos, we prepared an algorithm that balanced content categories, user engagement, and channel representation to achieve the target hours.

算法流程图

Some key parts of the content selection algorithm.

Activity Scoring: We calculate engagement metrics for each video by combining the number of comments, views and likes and assigning different weights. This scoring helps prioritize videos that resonate with viewers.
Video Selection: This step iteratively selects videos to reach a target duration while ensuring diversity. It balances high-engagement content with representation of various categories and channels, using a penalty system to avoid over-representation of any single channel.
Final Adjustment: We adjust the selection to be as close as possible to the target duration without exceeding it. It sorts the selected videos by duration and adds them to the final list until it reaches the closest possible to the target total duration.

The code can be found here[Link]。

Structured Output Annotation with Gemini 1.5 Pro and GPT4o

Why do you need structured data?

One of our goals in building FineVideo is to provide structured data to empower our community: if you're working on multimodal LLMs, you can slice and dice the data and decide which categories are suitable for your pre-training or fine-tuning portfolio. If you're more focused on computer vision, you can use the dataset directly to train classifiers based on numerical categories included in FineVideo, such as dynamics scores, scene boundaries, or audio/video relevance scores.

Structured Data and Gemini 1.5

Gemini 1.5 Pro allows generating JSON-based output by providing patterns. We explored this feature and quickly realized two problems: the

We couldn't adapt the original model to Gemini because our model is very complex
When we tried to use a slightly simplified schema (which is still quite complex), the quality of the Gemini results dropped significantly: most of the data for scene types (characters, activities, props) was missing. We tried splitting the hints into multiple hints and matching different hints to different parts of the pattern, but without much success.

Our observations are fully consistent with the experience of other researchers: adding specific pattern constraints may degrade performance. (Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models)。

Our solution relies on generating free text using Gemini 1.5 and adding a second processing step to align the Gemini results to our schema.

The Gemini tips we use are as follows.

Study the video and provide the following details about the video and the semantic scenarios that comprise it.

- characterList: a list of characters that appear throughout the video, along with a visual description that should allow me to recognize them just by looking at a picture of them.
- scenes: list of scenes, with the following attributes.
  - the start/end timestamp of the scene
  - a list of all the characters that appear in the scene
  - a list of all activities and their timestamps
  - list of all props and their timestamps
  - A list of all video editing details and their start/end timestamps. Details include transitions, effects, music, and suggestions such as scene clips that can be deleted and the reasons for them
  - Scene moods with descriptions of how visual effects, audio, and context contribute. Use the following taxonomy to return names only: {"moods":{"Positive":[{"name": "Happy", "description": "Feeling joyful, content, or delighted."},{"name": "Excited"," description": "Feeling enthusiastic, energetic, or eager."},{"name": "Calm", "description": "Feeling peaceful, relaxed, or serene."},{"name":" Grateful", "description": "Feeling appreciative or thankful."},{"name": "Proud", "description": "Feeling satisfied with one's achievements or the achievements of others."}], "Negative":[{"name": "Sad", "description": "Feeling down, unhappy, or sorrowful."},{"name": "Angry"," description": "Feeling irritated, frustrated, or furious."},{"name": "Anxious", "description": "Feeling nervous, worried, or uneasy."},{"name". : "Lonely", "description": "Feeling isolated, disconnected, or abandoned."},{"name": "Bored", "description": "Feeling uninterested, disengaged, or restless."}], {"name": "Bored", "description": "Feeling uninterested, disengaged , or restless."}], "Neutral":[{"name": "Indifferent", "description": "Feeling neither particularly positive nor negative."},{"name": "Content" , "description": "Feeling satisfied but not overly excited."},{"name": "Curious", "description": "Feeling interested or inquisitive without strong emotion."},{"name": "Confused", "description": "Feeling uncertain or unclear but without strong negative feelings."},{"name": "Pensive". "description": "Feeling thoughtful or reflective without strong emotional engagement."}]}}
    - Specific moments of emotional change within scenes, reporting timestamps and transitions in whatever dimension we are in (visual/auditory)
  - Scene Narrative Progression and Plot Development
    - Specific narrative moments within scenes. Reporting timestamps and what happens
  - Character interactions and dynamic descriptions and their start/end timestamps
  - Specific thematic elements and descriptions
  - Specific related events to create deeper meaning and subtext not explicitly stated but contributing to the richness and depth of the content Timestamps and descriptions
  - Scene Dynamics Score. Score ranges from 0 to 1. 1 is highly dynamic
  - Audio and video relevance score. Score ranges from 0 to 1. 0 means what we see is not relevant to the voice, 1 is highly relevant

- storylines: list of different storylines found and which scenes belong to it.
  - Specify the climax (scene and timestamp) if the content presents a narrative story or is more of a collection of factual or non-narrative information.
  - If there are scenes that don't belong in the story line, explain how they contribute to the video
- In terms of the overall video and story line, what video segments could be trimmed to make them more dynamic?
- q&a: A list of 5 Q&A's about the video focusing on details (objects and/or activities), overall story reasoning, and mood. Focus as much as possible on the Q&A aspects captured in the audio and video that are hard to get by just looking at the transcription.

Add Instructor

Once Gemini has processed the results, we use theInstructor Parsing it: A Pydantic-based library for implementing structured output given a pattern. See the table below for an example.

Instructor allows us to use different models to convert Gemini's free text into our defined Pydantic schema. We tried both Gemini and GPT4o and eventually chose GPT4o because of its higher success rate.

video

Click to view video

Gemini Output

  CharacterList:
  Man Slim build, brown eyes, shaved sides, black hoodie with colorful logo, black pants.
  Scenes
  Scene 1
  Start 0:00 End 0:55
  Characters: [Man]
  Activities:
  Introduces bus
  Describes peaceful location with cows
  Props: Bus, cows, deck.
  Mood:Excited, adventure.
  Narrative Progression:
  Introduction to bus.
  Tour begins outside, highlighting nature and relaxation.
  Dynamism Score 0.7
  Audio-Visual Correlation 1

Instructor output

{
  "title": "Bertie the Bus Tour",
  "description": "Guided tour of converted bus.",
  "characterList": [
    {
      "name": "Narrator",
      "description": "Slim build, brown eyes, shaved sides, black hoodie with colorful logo, black pants."
    }
  ],
  "scenes": [
    {
      "sceneId": 1,
      "title": "Introduction to Bus",
      "timestamps": {
        "start": "0:00",
        "end": "0:55"
      },
      "cast": ["Narrator"],
      "activities": [
        "Narrator speaks in front of bus",
        "Shows outdoor deck with chairs, cows nearby."
      ],
      "props": ["Bus", "Deck", "Cows"],
      "mood": "Excited, adventure."
    }
  ],
  "dynamismScore": 0.7,
  "audioVisualCorrelation": 1
}

It's worth noting that Gemini's content filtering discards some video, which is what can happen when using Gemini. In our case, the total number of minutes filtered out by Gemini was negligible due to the amount of content we were targeting.

The full code for labeling the video can be found here [link (on a website)]。

Fine Alignment and Exception Filtering

After the video is labeled and its data is correctly aligned to our schema, we look at the time domain of the data and make sure it is aligned to the video: Gemini 1.5 reads the video at 1 frame per second, whereas the video typically has 25-29 frames per second. In our fine alignment, we make sure that the scene boundaries provided by Gemini 1.5 match the correct frames in the video.

We also use this time-domain alignment to discard cases where Gemini stops providing useful data and video sections are mislabeled. Note that since we discard all content longer than 10 minutes early in the pipeline, the number of videos with poor quality data is negligible (less than 0.5%).

精细元数据 - 视频场景边界到镜头对齐作为丢弃异常值的机制

Fine Metadata - Video Scene Boundary to Shot Alignment as a Mechanism for Discarding Outliers

The code for the video alignment can be found here [link (on a website)]

future work

We are currently preparing to train multimodal LLMs using FineVideo, and we plan to share the model weights and training recipes with the community when we are done.

We're also open to other extensions to FineVideo, so let us know what you'd like to see!

Link to original article./blog/fine-video

Original authors: Miquel Farré, Andres Marafioti, Lewis Tunstall, Leandro von Werra, Pedro Cuenca, Thomas Wolf

Translator: roseking