Video classification
Last updated
Last updated
Video classification is the task of assigning a label or class to an entire video. Videos are expected to have only one class for each video. Video classification models take a video as input and return a prediction about which class the video belongs to. These models can be used to categorize what a video is all about. A real-world application of video classification is action / activity recognition, which is useful for fitness applications. It is also helpful for vision-impaired individuals, especially when they are commuting.
This guide will show you how to:
Fine-tune on a subset of the dataset.
Use your fine-tuned model for inference.
The task illustrated in this tutorial is supported by the following model architectures:
, ,
Before you begin, make sure you have all the necessary libraries installed:
Copied
You will use (dubbed pytorchvideo
) to process and prepare the videos.
We encourage you to log in to your BOINC AI account so you can upload and share your model with the community. When prompted, enter your token to log in:
Copied
Start by loading a subset of the . This will give you a chance to experiment and make sure everything works before spending more time training on the full dataset.
Copied
After the subset has been downloaded, you need to extract the compressed archive:
Copied
At a high level, the dataset is organized like so:
Copied
The (sorted
) video paths appear like so:
Copied
You will notice that there are video clips belonging to the same group / scene where group is denoted by g
in the video file paths. v_ApplyEyeMakeup_g07_c04.avi
and v_ApplyEyeMakeup_g07_c06.avi
, for example.
Next up, you will derive the set of labels present in the dataset. Also, create two dictionaries thatโll be helpful when initializing the model:
label2id
: maps the class names to integers.
id2label
: maps the integers to class names.
Copied
There are 10 unique classes. For each class, there are 30 videos in the training set.
Instantiate a video classification model from a pretrained checkpoint and its associated image processor. The modelโs encoder comes with pre-trained parameters, and the classification head is randomly initialized. The image processor will come in handy when writing the preprocessing pipeline for our dataset.
Copied
While the model is loading, you might notice the following warning:
Copied
The warning is telling us we are throwing away some weights (e.g. the weights and bias of the classifier
layer) and randomly initializing some others (the weights and bias of a new classifier
layer). This is expected in this case, because we are adding a new head for which we donโt have pretrained weights, so the library warns us we should fine-tune this model before using it for inference, which is exactly what we are going to do.
Copied
Use the image_processor
associated with the pre-trained model to obtain the following information:
Image mean and standard deviation with which the video frame pixels will be normalized.
Spatial resolution to which the video frames will be resized.
Start by defining some constants.
Copied
Now, define the dataset-specific transformations and the datasets respectively. Starting with the training set:
Copied
The same sequence of workflow can be applied to the validation and evaluation sets:
Copied
You can access the num_videos
argument to know the number of videos in the dataset.
Copied
Copied
Most of the training arguments are self-explanatory, but one that is quite important here is remove_unused_columns=False
. This one will drop any features not used by the modelโs call function. By default itโs True
because usually itโs ideal to drop unused feature columns, making it easier to unpack inputs into the modelโs call function. But, in this case, you need the unused features (โvideoโ in particular) in order to create pixel_values
(which is a mandatory key our model expects in its inputs).
Copied
The dataset returned by pytorchvideo.data.Ucf101()
doesnโt implement the __len__
method. As such, we must define max_steps
when instantiating TrainingArguments
.
Next, you need to define a function to compute the metrics from the predictions, which will use the metric
youโll load now. The only preprocessing you have to do is to take the argmax of our predicted logits:
Copied
A note on evaluation:
Also, define a collate_fn
, which will be used to batch examples together. Each batch consists of 2 keys, namely pixel_values
and labels
.
Copied
Then you just pass all of this along with the datasets to Trainer
:
Copied
You might wonder why you passed along the image_processor
as a tokenizer when you preprocessed the data already. This is only to make sure the image processor configuration file (stored as JSON) will also be uploaded to the repo on the Hub.
Now fine-tune our model by calling the train
method:
Copied
Copied
Great, now that you have fine-tuned a model, you can use it for inference!
Load a video for inference:
Copied
Copied
You can also manually replicate the results of the pipeline
if youโd like.
Copied
Now, pass your input to the model and return the logits
:
Copied
Decoding the logits
, we get:
Copied
For the validation and evaluation splits, you wouldnโt want to have video clips from the same group / scene to prevent . The subset that you are using in this tutorial takes this information into account.
Note that leads to better performance on this task as the checkpoint was obtained fine-tuning on a similar downstream task having considerable domain overlap. You can check out which was obtained by fine-tuning MCG-NJU/videomae-base-finetuned-kinetics
.
For preprocessing the videos, you will leverage the . Start by importing the dependencies we need.
For the training dataset transformations, use a combination of uniform temporal subsampling, pixel normalization, random cropping, and random horizontal flipping. For the validation and evaluation dataset transformations, keep the same transformation chain except for random cropping and horizontal flipping. To learn more about the details of these transformations check out the .
Note: The above dataset pipelines are taken from the . Weโre using the function because itโs tailored for the UCF-101 dataset. Under the hood, it returns a object. LabeledVideoDataset
class is the base class for all things video in the PyTorchVideo dataset. So, if you want to use a custom dataset not supported off-the-shelf by PyTorchVideo, you can extend the LabeledVideoDataset
class accordingly. Refer to the data
API learn more. Also, if your dataset follows a similar structure (as shown above), then using the pytorchvideo.data.Ucf101()
should work just fine.
Leverage from ๐Transformers for training the model. To instantiate a Trainer
, you need to define the training configuration and an evaluation metric. The most important is the , which is a class that contains all the attributes to configure the training. It requires an output folder name, which will be used to save the checkpoints of the model. It also helps sync all the information in the model repository on ๐ Hub.
In the , the authors use the following evaluation strategy. They evaluate the model on several clips from test videos and apply different crops to those clips and report the aggregate score. However, in the interest of simplicity and brevity, we donโt consider that in this tutorial.
Once training is completed, share your model to the Hub with the method so everyone can use your model:
The simplest way to try out your fine-tuned model for inference is to use it in a . Instantiate a pipeline
for video classification with your model, and pass your video to it: