Transformers
  • ๐ŸŒGET STARTED
    • Transformers
    • Quick tour
    • Installation
  • ๐ŸŒTUTORIALS
    • Run inference with pipelines
    • Write portable code with AutoClass
    • Preprocess data
    • Fine-tune a pretrained model
    • Train with a script
    • Set up distributed training with BOINC AI Accelerate
    • Load and train adapters with BOINC AI PEFT
    • Share your model
    • Agents
    • Generation with LLMs
  • ๐ŸŒTASK GUIDES
    • ๐ŸŒNATURAL LANGUAGE PROCESSING
      • Text classification
      • Token classification
      • Question answering
      • Causal language modeling
      • Masked language modeling
      • Translation
      • Summarization
      • Multiple choice
    • ๐ŸŒAUDIO
      • Audio classification
      • Automatic speech recognition
    • ๐ŸŒCOMPUTER VISION
      • Image classification
      • Semantic segmentation
      • Video classification
      • Object detection
      • Zero-shot object detection
      • Zero-shot image classification
      • Depth estimation
    • ๐ŸŒMULTIMODAL
      • Image captioning
      • Document Question Answering
      • Visual Question Answering
      • Text to speech
    • ๐ŸŒGENERATION
      • Customize the generation strategy
    • ๐ŸŒPROMPTING
      • Image tasks with IDEFICS
  • ๐ŸŒDEVELOPER GUIDES
    • Use fast tokenizers from BOINC AI Tokenizers
    • Run inference with multilingual models
    • Use model-specific APIs
    • Share a custom model
    • Templates for chat models
    • Run training on Amazon SageMaker
    • Export to ONNX
    • Export to TFLite
    • Export to TorchScript
    • Benchmarks
    • Notebooks with examples
    • Community resources
    • Custom Tools and Prompts
    • Troubleshoot
  • ๐ŸŒPERFORMANCE AND SCALABILITY
    • Overview
    • ๐ŸŒEFFICIENT TRAINING TECHNIQUES
      • Methods and tools for efficient training on a single GPU
      • Multiple GPUs and parallelism
      • Efficient training on CPU
      • Distributed CPU training
      • Training on TPUs
      • Training on TPU with TensorFlow
      • Training on Specialized Hardware
      • Custom hardware for training
      • Hyperparameter Search using Trainer API
    • ๐ŸŒOPTIMIZING INFERENCE
      • Inference on CPU
      • Inference on one GPU
      • Inference on many GPUs
      • Inference on Specialized Hardware
    • Instantiating a big model
    • Troubleshooting
    • XLA Integration for TensorFlow Models
    • Optimize inference using `torch.compile()`
  • ๐ŸŒCONTRIBUTE
    • How to contribute to transformers?
    • How to add a model to BOINC AI Transformers?
    • How to convert a BOINC AI Transformers model to TensorFlow?
    • How to add a pipeline to BOINC AI Transformers?
    • Testing
    • Checks on a Pull Request
  • ๐ŸŒCONCEPTUAL GUIDES
    • Philosophy
    • Glossary
    • What BOINC AI Transformers can do
    • How BOINC AI Transformers solve tasks
    • The Transformer model family
    • Summary of the tokenizers
    • Attention mechanisms
    • Padding and truncation
    • BERTology
    • Perplexity of fixed-length models
    • Pipelines for webserver inference
    • Model training anatomy
  • ๐ŸŒAPI
    • ๐ŸŒMAIN CLASSES
      • Agents and Tools
      • ๐ŸŒAuto Classes
        • Extending the Auto Classes
        • AutoConfig
        • AutoTokenizer
        • AutoFeatureExtractor
        • AutoImageProcessor
        • AutoProcessor
        • Generic model classes
          • AutoModel
          • TFAutoModel
          • FlaxAutoModel
        • Generic pretraining classes
          • AutoModelForPreTraining
          • TFAutoModelForPreTraining
          • FlaxAutoModelForPreTraining
        • Natural Language Processing
          • AutoModelForCausalLM
          • TFAutoModelForCausalLM
          • FlaxAutoModelForCausalLM
          • AutoModelForMaskedLM
          • TFAutoModelForMaskedLM
          • FlaxAutoModelForMaskedLM
          • AutoModelForMaskGenerationge
          • TFAutoModelForMaskGeneration
          • AutoModelForSeq2SeqLM
          • TFAutoModelForSeq2SeqLM
          • FlaxAutoModelForSeq2SeqLM
          • AutoModelForSequenceClassification
          • TFAutoModelForSequenceClassification
          • FlaxAutoModelForSequenceClassification
          • AutoModelForMultipleChoice
          • TFAutoModelForMultipleChoice
          • FlaxAutoModelForMultipleChoice
          • AutoModelForNextSentencePrediction
          • TFAutoModelForNextSentencePrediction
          • FlaxAutoModelForNextSentencePrediction
          • AutoModelForTokenClassification
          • TFAutoModelForTokenClassification
          • FlaxAutoModelForTokenClassification
          • AutoModelForQuestionAnswering
          • TFAutoModelForQuestionAnswering
          • FlaxAutoModelForQuestionAnswering
          • AutoModelForTextEncoding
          • TFAutoModelForTextEncoding
        • Computer vision
          • AutoModelForDepthEstimation
          • AutoModelForImageClassification
          • TFAutoModelForImageClassification
          • FlaxAutoModelForImageClassification
          • AutoModelForVideoClassification
          • AutoModelForMaskedImageModeling
          • TFAutoModelForMaskedImageModeling
          • AutoModelForObjectDetection
          • AutoModelForImageSegmentation
          • AutoModelForImageToImage
          • AutoModelForSemanticSegmentation
          • TFAutoModelForSemanticSegmentation
          • AutoModelForInstanceSegmentation
          • AutoModelForUniversalSegmentation
          • AutoModelForZeroShotImageClassification
          • TFAutoModelForZeroShotImageClassification
          • AutoModelForZeroShotObjectDetection
        • Audio
          • AutoModelForAudioClassification
          • AutoModelForAudioFrameClassification
          • TFAutoModelForAudioFrameClassification
          • AutoModelForCTC
          • AutoModelForSpeechSeq2Seq
          • TFAutoModelForSpeechSeq2Seq
          • FlaxAutoModelForSpeechSeq2Seq
          • AutoModelForAudioXVector
          • AutoModelForTextToSpectrogram
          • AutoModelForTextToWaveform
        • Multimodal
          • AutoModelForTableQuestionAnswering
          • TFAutoModelForTableQuestionAnswering
          • AutoModelForDocumentQuestionAnswering
          • TFAutoModelForDocumentQuestionAnswering
          • AutoModelForVisualQuestionAnswering
          • AutoModelForVision2Seq
          • TFAutoModelForVision2Seq
          • FlaxAutoModelForVision2Seq
      • Callbacks
      • Configuration
      • Data Collator
      • Keras callbacks
      • Logging
      • Models
      • Text Generation
      • ONNX
      • Optimization
      • Model outputs
      • Pipelines
      • Processors
      • Quantization
      • Tokenizer
      • Trainer
      • DeepSpeed Integration
      • Feature Extractor
      • Image Processor
    • ๐ŸŒMODELS
      • ๐ŸŒTEXT MODELS
        • ALBERT
        • BART
        • BARThez
        • BARTpho
        • BERT
        • BertGeneration
        • BertJapanese
        • Bertweet
        • BigBird
        • BigBirdPegasus
        • BioGpt
        • Blenderbot
        • Blenderbot Small
        • BLOOM
        • BORT
        • ByT5
        • CamemBERT
        • CANINE
        • CodeGen
        • CodeLlama
        • ConvBERT
        • CPM
        • CPMANT
        • CTRL
        • DeBERTa
        • DeBERTa-v2
        • DialoGPT
        • DistilBERT
        • DPR
        • ELECTRA
        • Encoder Decoder Models
        • ERNIE
        • ErnieM
        • ESM
        • Falcon
        • FLAN-T5
        • FLAN-UL2
        • FlauBERT
        • FNet
        • FSMT
        • Funnel Transformer
        • GPT
        • GPT Neo
        • GPT NeoX
        • GPT NeoX Japanese
        • GPT-J
        • GPT2
        • GPTBigCode
        • GPTSAN Japanese
        • GPTSw3
        • HerBERT
        • I-BERT
        • Jukebox
        • LED
        • LLaMA
        • LLama2
        • Longformer
        • LongT5
        • LUKE
        • M2M100
        • MarianMT
        • MarkupLM
        • MBart and MBart-50
        • MEGA
        • MegatronBERT
        • MegatronGPT2
        • Mistral
        • mLUKE
        • MobileBERT
        • MPNet
        • MPT
        • MRA
        • MT5
        • MVP
        • NEZHA
        • NLLB
        • NLLB-MoE
        • Nystrรถmformer
        • Open-Llama
        • OPT
        • Pegasus
        • PEGASUS-X
        • Persimmon
        • PhoBERT
        • PLBart
        • ProphetNet
        • QDQBert
        • RAG
        • REALM
        • Reformer
        • RemBERT
        • RetriBERT
        • RoBERTa
        • RoBERTa-PreLayerNorm
        • RoCBert
        • RoFormer
        • RWKV
        • Splinter
        • SqueezeBERT
        • SwitchTransformers
        • T5
        • T5v1.1
        • TAPEX
        • Transformer XL
        • UL2
        • UMT5
        • X-MOD
        • XGLM
        • XLM
        • XLM-ProphetNet
        • XLM-RoBERTa
        • XLM-RoBERTa-XL
        • XLM-V
        • XLNet
        • YOSO
      • ๐ŸŒVISION MODELS
        • BEiT
        • BiT
        • Conditional DETR
        • ConvNeXT
        • ConvNeXTV2
        • CvT
        • Deformable DETR
        • DeiT
        • DETA
        • DETR
        • DiNAT
        • DINO V2
        • DiT
        • DPT
        • EfficientFormer
        • EfficientNet
        • FocalNet
        • GLPN
        • ImageGPT
        • LeViT
        • Mask2Former
        • MaskFormer
        • MobileNetV1
        • MobileNetV2
        • MobileViT
        • MobileViTV2
        • NAT
        • PoolFormer
        • Pyramid Vision Transformer (PVT)
        • RegNet
        • ResNet
        • SegFormer
        • SwiftFormer
        • Swin Transformer
        • Swin Transformer V2
        • Swin2SR
        • Table Transformer
        • TimeSformer
        • UperNet
        • VAN
        • VideoMAE
        • Vision Transformer (ViT)
        • ViT Hybrid
        • ViTDet
        • ViTMAE
        • ViTMatte
        • ViTMSN
        • ViViT
        • YOLOS
      • ๐ŸŒAUDIO MODELS
        • Audio Spectrogram Transformer
        • Bark
        • CLAP
        • EnCodec
        • Hubert
        • MCTCT
        • MMS
        • MusicGen
        • Pop2Piano
        • SEW
        • SEW-D
        • Speech2Text
        • Speech2Text2
        • SpeechT5
        • UniSpeech
        • UniSpeech-SAT
        • VITS
        • Wav2Vec2
        • Wav2Vec2-Conformer
        • Wav2Vec2Phoneme
        • WavLM
        • Whisper
        • XLS-R
        • XLSR-Wav2Vec2
      • ๐ŸŒMULTIMODAL MODELS
        • ALIGN
        • AltCLIP
        • BLIP
        • BLIP-2
        • BridgeTower
        • BROS
        • Chinese-CLIP
        • CLIP
        • CLIPSeg
        • Data2Vec
        • DePlot
        • Donut
        • FLAVA
        • GIT
        • GroupViT
        • IDEFICS
        • InstructBLIP
        • LayoutLM
        • LayoutLMV2
        • LayoutLMV3
        • LayoutXLM
        • LiLT
        • LXMERT
        • MatCha
        • MGP-STR
        • Nougat
        • OneFormer
        • OWL-ViT
        • Perceiver
        • Pix2Struct
        • Segment Anything
        • Speech Encoder Decoder Models
        • TAPAS
        • TrOCR
        • TVLT
        • ViLT
        • Vision Encoder Decoder Models
        • Vision Text Dual Encoder
        • VisualBERT
        • X-CLIP
      • ๐ŸŒREINFORCEMENT LEARNING MODELS
        • Decision Transformer
        • Trajectory Transformer
      • ๐ŸŒTIME SERIES MODELS
        • Autoformer
        • Informer
        • Time Series Transformer
      • ๐ŸŒGRAPH MODELS
        • Graphormer
  • ๐ŸŒINTERNAL HELPERS
    • Custom Layers and Utilities
    • Utilities for pipelines
    • Utilities for Tokenizers
    • Utilities for Trainer
    • Utilities for Generation
    • Utilities for Image Processors
    • Utilities for Audio processing
    • General Utilities
    • Utilities for Time Series
Powered by GitBook
On this page
  • Load Food-101 dataset
  • Preprocess
  • Evaluate
  • Train
  • Inference
  1. TASK GUIDES
  2. COMPUTER VISION

Image classification

PreviousCOMPUTER VISIONNextSemantic segmentation

Last updated 1 year ago

Image classification assigns a label or class to an image. Unlike text or audio classification, the inputs are the pixel values that comprise an image. There are many applications for image classification, such as detecting damage after a natural disaster, monitoring crop health, or helping screen medical images for signs of disease.

This guide illustrates how to:

  1. Fine-tune on the dataset to classify a food item in an image.

  2. Use your fine-tuned model for inference.

The task illustrated in this tutorial is supported by the following model architectures:

, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,

Before you begin, make sure you have all the necessary libraries installed:

Copied

pip install transformers datasets evaluate

We encourage you to log in to your BOINC AI account to upload and share your model with the community. When prompted, enter your token to log in:

Copied

>>> from boincai_hub import notebook_login

>>> notebook_login()

Load Food-101 dataset

Start by loading a smaller subset of the Food-101 dataset from the ๐ŸŒ Datasets library. This will give you a chance to experiment and make sure everything works before spending more time training on the full dataset.

Copied

>>> from datasets import load_dataset

>>> food = load_dataset("food101", split="train[:5000]")

Copied

>>> food = food.train_test_split(test_size=0.2)

Then take a look at an example:

Copied

>>> food["train"][0]
{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=512x512 at 0x7F52AFC8AC50>,
 'label': 79}

Each example in the dataset has two fields:

  • image: a PIL image of the food item

  • label: the label class of the food item

To make it easier for the model to get the label name from the label id, create a dictionary that maps the label name to an integer and vice versa:

Copied

>>> labels = food["train"].features["label"].names
>>> label2id, id2label = dict(), dict()
>>> for i, label in enumerate(labels):
...     label2id[label] = str(i)
...     id2label[str(i)] = label

Now you can convert the label id to a label name:

Copied

>>> id2label[str(79)]
'prime_rib'

Preprocess

The next step is to load a ViT image processor to process the image into a tensor:

Copied

>>> from transformers import AutoImageProcessor

>>> checkpoint = "google/vit-base-patch16-224-in21k"
>>> image_processor = AutoImageProcessor.from_pretrained(checkpoint)

PytorchHide Pytorch content

Crop a random part of the image, resize it, and normalize it with the image mean and standard deviation:

Copied

>>> from torchvision.transforms import RandomResizedCrop, Compose, Normalize, ToTensor

>>> normalize = Normalize(mean=image_processor.image_mean, std=image_processor.image_std)
>>> size = (
...     image_processor.size["shortest_edge"]
...     if "shortest_edge" in image_processor.size
...     else (image_processor.size["height"], image_processor.size["width"])
... )
>>> _transforms = Compose([RandomResizedCrop(size), ToTensor(), normalize])

Then create a preprocessing function to apply the transforms and return the pixel_values - the inputs to the model - of the image:

Copied

>>> def transforms(examples):
...     examples["pixel_values"] = [_transforms(img.convert("RGB")) for img in examples["image"]]
...     del examples["image"]
...     return examples

Copied

>>> food = food.with_transform(transforms)

Copied

>>> from transformers import DefaultDataCollator

>>> data_collator = DefaultDataCollator()

TensorFlowHide TensorFlow content

To avoid overfitting and to make the model more robust, add some data augmentation to the training part of the dataset. Here we use Keras preprocessing layers to define the transformations for the training data (includes data augmentation), and transformations for the validation data (only center cropping, resizing and normalizing). You can use tf.imageor any other library you prefer.

Copied

>>> from tensorflow import keras
>>> from tensorflow.keras import layers

>>> size = (image_processor.size["height"], image_processor.size["width"])

>>> train_data_augmentation = keras.Sequential(
...     [
...         layers.RandomCrop(size[0], size[1]),
...         layers.Rescaling(scale=1.0 / 127.5, offset=-1),
...         layers.RandomFlip("horizontal"),
...         layers.RandomRotation(factor=0.02),
...         layers.RandomZoom(height_factor=0.2, width_factor=0.2),
...     ],
...     name="train_data_augmentation",
... )

>>> val_data_augmentation = keras.Sequential(
...     [
...         layers.CenterCrop(size[0], size[1]),
...         layers.Rescaling(scale=1.0 / 127.5, offset=-1),
...     ],
...     name="val_data_augmentation",
... )

Next, create functions to apply appropriate transformations to a batch of images, instead of one image at a time.

Copied

>>> import numpy as np
>>> import tensorflow as tf
>>> from PIL import Image


>>> def convert_to_tf_tensor(image: Image):
...     np_image = np.array(image)
...     tf_image = tf.convert_to_tensor(np_image)
...     # `expand_dims()` is used to add a batch dimension since
...     # the TF augmentation layers operates on batched inputs.
...     return tf.expand_dims(tf_image, 0)


>>> def preprocess_train(example_batch):
...     """Apply train_transforms across a batch."""
...     images = [
...         train_data_augmentation(convert_to_tf_tensor(image.convert("RGB"))) for image in example_batch["image"]
...     ]
...     example_batch["pixel_values"] = [tf.transpose(tf.squeeze(image)) for image in images]
...     return example_batch


... def preprocess_val(example_batch):
...     """Apply val_transforms across a batch."""
...     images = [
...         val_data_augmentation(convert_to_tf_tensor(image.convert("RGB"))) for image in example_batch["image"]
...     ]
...     example_batch["pixel_values"] = [tf.transpose(tf.squeeze(image)) for image in images]
...     return example_batch

Copied

food["train"].set_transform(preprocess_train)
food["test"].set_transform(preprocess_val)

As a final preprocessing step, create a batch of examples using DefaultDataCollator. Unlike other data collators in ๐ŸŒ Transformers, the DefaultDataCollator does not apply additional preprocessing, such as padding.

Copied

>>> from transformers import DefaultDataCollator

>>> data_collator = DefaultDataCollator(return_tensors="tf")

Evaluate

Copied

>>> import evaluate

>>> accuracy = evaluate.load("accuracy")

Then create a function that passes your predictions and labels to compute to calculate the accuracy:

Copied

>>> import numpy as np


>>> def compute_metrics(eval_pred):
...     predictions, labels = eval_pred
...     predictions = np.argmax(predictions, axis=1)
...     return accuracy.compute(predictions=predictions, references=labels)

Your compute_metrics function is ready to go now, and youโ€™ll return to it when you set up your training.

Train

PytorchHide Pytorch content

Copied

>>> from transformers import AutoModelForImageClassification, TrainingArguments, Trainer

>>> model = AutoModelForImageClassification.from_pretrained(
...     checkpoint,
...     num_labels=len(labels),
...     id2label=id2label,
...     label2id=label2id,
... )

At this point, only three steps remain:

Copied

>>> training_args = TrainingArguments(
...     output_dir="my_awesome_food_model",
...     remove_unused_columns=False,
...     evaluation_strategy="epoch",
...     save_strategy="epoch",
...     learning_rate=5e-5,
...     per_device_train_batch_size=16,
...     gradient_accumulation_steps=4,
...     per_device_eval_batch_size=16,
...     num_train_epochs=3,
...     warmup_ratio=0.1,
...     logging_steps=10,
...     load_best_model_at_end=True,
...     metric_for_best_model="accuracy",
...     push_to_hub=True,
... )

>>> trainer = Trainer(
...     model=model,
...     args=training_args,
...     data_collator=data_collator,
...     train_dataset=food["train"],
...     eval_dataset=food["test"],
...     tokenizer=image_processor,
...     compute_metrics=compute_metrics,
... )

>>> trainer.train()

Copied

>>> trainer.push_to_hub()

TensorFlowHide TensorFlow content

To fine-tune a model in TensorFlow, follow these steps:

  1. Define the training hyperparameters, and set up an optimizer and a learning rate schedule.

  2. Instantiate a pre-trained model.

  3. Convert a ๐ŸŒ Dataset to a tf.data.Dataset.

  4. Compile your model.

  5. Add callbacks and use the fit() method to run the training.

  6. Upload your model to ๐ŸŒ Hub to share with the community.

Start by defining the hyperparameters, optimizer and learning rate schedule:

Copied

>>> from transformers import create_optimizer

>>> batch_size = 16
>>> num_epochs = 5
>>> num_train_steps = len(food["train"]) * num_epochs
>>> learning_rate = 3e-5
>>> weight_decay_rate = 0.01

>>> optimizer, lr_schedule = create_optimizer(
...     init_lr=learning_rate,
...     num_train_steps=num_train_steps,
...     weight_decay_rate=weight_decay_rate,
...     num_warmup_steps=0,
... )

Copied

>>> from transformers import TFAutoModelForImageClassification

>>> model = TFAutoModelForImageClassification.from_pretrained(
...     checkpoint,
...     id2label=id2label,
...     label2id=label2id,
... )

Copied

>>> # converting our train dataset to tf.data.Dataset
>>> tf_train_dataset = food["train"].to_tf_dataset(
...     columns="pixel_values", label_cols="label", shuffle=True, batch_size=batch_size, collate_fn=data_collator
... )

>>> # converting our test dataset to tf.data.Dataset
>>> tf_eval_dataset = food["test"].to_tf_dataset(
...     columns="pixel_values", label_cols="label", shuffle=True, batch_size=batch_size, collate_fn=data_collator
... )

Configure the model for training with compile():

Copied

>>> from tensorflow.keras.losses import SparseCategoricalCrossentropy

>>> loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
>>> model.compile(optimizer=optimizer, loss=loss)

Copied

>>> from transformers.keras_callbacks import KerasMetricCallback, PushToHubCallback

>>> metric_callback = KerasMetricCallback(metric_fn=compute_metrics, eval_dataset=tf_eval_dataset)
>>> push_to_hub_callback = PushToHubCallback(
...     output_dir="food_classifier",
...     tokenizer=image_processor,
...     save_strategy="no",
... )
>>> callbacks = [metric_callback, push_to_hub_callback]

Finally, you are ready to train your model! Call fit() with your training and validation datasets, the number of epochs, and your callbacks to fine-tune the model:

Copied

>>> model.fit(tf_train_dataset, validation_data=tf_eval_dataset, epochs=num_epochs, callbacks=callbacks)
Epoch 1/5
250/250 [==============================] - 313s 1s/step - loss: 2.5623 - val_loss: 1.4161 - accuracy: 0.9290
Epoch 2/5
250/250 [==============================] - 265s 1s/step - loss: 0.9181 - val_loss: 0.6808 - accuracy: 0.9690
Epoch 3/5
250/250 [==============================] - 252s 1s/step - loss: 0.3910 - val_loss: 0.4303 - accuracy: 0.9820
Epoch 4/5
250/250 [==============================] - 251s 1s/step - loss: 0.2028 - val_loss: 0.3191 - accuracy: 0.9900
Epoch 5/5
250/250 [==============================] - 238s 949ms/step - loss: 0.1232 - val_loss: 0.3259 - accuracy: 0.9890

Congratulations! You have fine-tuned your model and shared it on the ๐ŸŒ Hub. You can now use it for inference!

Inference

Great, now that youโ€™ve fine-tuned a model, you can use it for inference!

Load an image youโ€™d like to run inference on:

Copied

>>> ds = load_dataset("food101", split="validation[:10]")
>>> image = ds["image"][0]

Copied

>>> from transformers import pipeline

>>> classifier = pipeline("image-classification", model="my_awesome_food_model")
>>> classifier(image)
[{'score': 0.31856709718704224, 'label': 'beignets'},
 {'score': 0.015232225880026817, 'label': 'bruschetta'},
 {'score': 0.01519392803311348, 'label': 'chicken_wings'},
 {'score': 0.013022331520915031, 'label': 'pork_chop'},
 {'score': 0.012728818692266941, 'label': 'prime_rib'}]

You can also manually replicate the results of the pipeline if youโ€™d like:

PytorchHide Pytorch content

Load an image processor to preprocess the image and return the input as PyTorch tensors:

Copied

>>> from transformers import AutoImageProcessor
>>> import torch

>>> image_processor = AutoImageProcessor.from_pretrained("my_awesome_food_model")
>>> inputs = image_processor(image, return_tensors="pt")

Pass your inputs to the model and return the logits:

Copied

>>> from transformers import AutoModelForImageClassification

>>> model = AutoModelForImageClassification.from_pretrained("my_awesome_food_model")
>>> with torch.no_grad():
...     logits = model(**inputs).logits

Get the predicted label with the highest probability, and use the modelโ€™s id2label mapping to convert it to a label:

Copied

>>> predicted_label = logits.argmax(-1).item()
>>> model.config.id2label[predicted_label]
'beignets'

TensorFlowHide TensorFlow content

Load an image processor to preprocess the image and return the input as TensorFlow tensors:

Copied

>>> from transformers import AutoImageProcessor

>>> image_processor = AutoImageProcessor.from_pretrained("MariaK/food_classifier")
>>> inputs = image_processor(image, return_tensors="tf")

Pass your inputs to the model and return the logits:

Copied

>>> from transformers import TFAutoModelForImageClassification

>>> model = TFAutoModelForImageClassification.from_pretrained("MariaK/food_classifier")
>>> logits = model(**inputs).logits

Get the predicted label with the highest probability, and use the modelโ€™s id2label mapping to convert it to a label:

Copied

>>> predicted_class_id = int(tf.math.argmax(logits, axis=-1)[0])
>>> model.config.id2label[predicted_class_id]
'beignets'

Split the datasetโ€™s train split into a train and test set with the method:

Apply some image transformations to the images to make the model more robust against overfitting. Here youโ€™ll use torchvisionโ€™s module, but you can also use any image library you like.

To apply the preprocessing function over the entire dataset, use ๐ŸŒ Datasets method. The transforms are applied on the fly when you load an element of the dataset:

Now create a batch of examples using . Unlike other data collators in ๐ŸŒTransformers, the DefaultDataCollator does not apply additional preprocessing such as padding.

Use ๐ŸŒ Datasets to apply the transformations on the fly:

Including a metric during training is often helpful for evaluating your modelโ€™s performance. You can quickly load an evaluation method with the ๐ŸŒ library. For this task, load the metric (see the ๐ŸŒ Evaluate to learn more about how to load and compute a metric):

If you arenโ€™t familiar with finetuning a model with the , take a look at the basic tutorial !

Youโ€™re ready to start training your model now! Load ViT with . Specify the number of labels along with the number of expected labels, and the label mappings:

Define your training hyperparameters in . It is important you donโ€™t remove unused columns because thatโ€™ll drop the image column. Without the image column, you canโ€™t create pixel_values. Set remove_unused_columns=False to prevent this behavior! The only other required parameter is output_dir which specifies where to save your model. Youโ€™ll push this model to the Hub by setting push_to_hub=True (you need to be signed in to BOINC AI to upload your model). At the end of each epoch, the will evaluate the accuracy and save the training checkpoint.

Pass the training arguments to along with the model, dataset, tokenizer, data collator, and compute_metrics function.

Call to finetune your model.

Once training is completed, share your model to the Hub with the method so everyone can use your model:

If you are unfamiliar with fine-tuning a model with Keras, check out the first!

Then, load ViT with along with the label mappings:

Convert your datasets to the tf.data.Dataset format using the and your data_collator:

To compute the accuracy from the predictions and push your model to the ๐ŸŒ Hub, use . Pass your compute_metrics function to , and use the to upload the model:

For a more in-depth example of how to finetune a model for image classification, take a look at the corresponding .

The simplest way to try out your finetuned model for inference is to use it in a . Instantiate a pipeline for image classification with your model, and pass your image to it:

๐ŸŒ
๐ŸŒ
ViT
Food-101
BEiT
BiT
ConvNeXT
ConvNeXTV2
CvT
Data2VecVision
DeiT
DiNAT
DINOv2
EfficientFormer
EfficientNet
FocalNet
ImageGPT
LeViT
MobileNetV1
MobileNetV2
MobileViT
MobileViTV2
NAT
Perceiver
PoolFormer
PVT
RegNet
ResNet
SegFormer
SwiftFormer
Swin Transformer
Swin Transformer V2
VAN
ViT
ViT Hybrid
ViTMSN
train_test_split
transforms
with_transform
DefaultDataCollator
set_transform
Evaluate
accuracy
quick tour
Trainer
here
AutoModelForImageClassification
TrainingArguments
Trainer
Trainer
train()
push_to_hub()
basic tutorial
TFAutoModelForImageClassification
to_tf_dataset
Keras callbacks
KerasMetricCallback
PushToHubCallback
PyTorch notebook
pipeline()