Datasets
  • 🌍GET STARTED
    • Datasets
    • Quickstart
    • Installation
  • 🌍TUTORIALS
    • Overview
    • Load a dataset from the Hub
    • Know your dataset
    • Preprocess
    • Evaluate predictions
    • Create a data
    • Share a dataset to the Hub
  • 🌍HOW-TO GUIDES
    • Overview
    • 🌍GENERAL USAGE
      • Load
      • Process
      • Stream
      • Use with TensorFlow
      • Use with PyTorch
      • Use with JAX
      • Use with Spark
      • Cache management
      • Cloud storage
      • Search index
      • Metrics
      • Beam Datasets
    • 🌍AUDIO
      • Load audio data
      • Process audio data
      • Create an audio dataset
    • 🌍VISION
      • Load image data
      • Process image data
      • Create an image dataset
      • Depth estimation
      • Image classification
      • Semantic segmentation
      • Object detection
    • 🌍TEXT
      • Load text data
      • Process text data
    • 🌍TABULAR
      • Load tabular data
    • 🌍DATASET REPOSITORY
      • Share
      • Create a dataset card
      • Structure your repository
      • Create a dataset loading script
  • 🌍CONCEPTUAL GUIDES
    • Datasets with Arrow
    • The cache
    • Dataset or IterableDataset
    • Dataset features
    • Build and load
    • Batch mapping
    • All about metrics
  • 🌍REFERENCE
    • Main classes
    • Builder classes
    • Loading methods
    • Table Classes
    • Logging methods
    • Task templates
Powered by GitBook
On this page
  • Evaluate predictions
  • Load metric
  • Select a configuration
  • Metrics object
  • Compute metric
  1. TUTORIALS

Evaluate predictions

PreviousPreprocessNextCreate a data

Last updated 1 year ago

Evaluate predictions

Metrics is deprecated in 🌍 Datasets. To learn more about how to use metrics, take a look at the library 🌍 ! In addition to metrics, you can find more tools for evaluating models and datasets.

🌍 Datasets provides various common and NLP-specific for you to measure your models performance. In this section of the tutorials, you will load a metric and use it to evaluate your models predictions.

You can see what metrics are available with :

Copied

>>> from datasets import list_metrics
>>> metrics_list = list_metrics()
>>> len(metrics_list)
28
>>> print(metrics_list)
['accuracy', 'bertscore', 'bleu', 'bleurt', 'cer', 'comet', 'coval', 'cuad', 'f1', 'gleu', 'glue', 'indic_glue', 'matthews_correlation', 'meteor', 'pearsonr', 'precision', 'recall', 'rouge', 'sacrebleu', 'sari', 'seqeval', 'spearmanr', 'squad', 'squad_v2', 'super_glue', 'wer', 'wiki_split', 'xnli']

Load metric

It is very easy to load a metric with 🌍 Datasets. In fact, you will notice that it is very similar to loading a dataset! Load a metric from the Hub with :

Copied

>>> from datasets import load_metric
>>> metric = load_metric('glue', 'mrpc')

This will load the metric associated with the MRPC dataset from the GLUE benchmark.

Select a configuration

If you are using a benchmark dataset, you need to select a metric that is associated with the configuration you are using. Select a metric configuration by providing the configuration name:

Copied

>>> metric = load_metric('glue', 'mrpc')

Metrics object

Copied

>>> print(metric.inputs_description)
Compute GLUE evaluation metric associated to each GLUE dataset.
Args:
    predictions: list of predictions to score.
        Each translation should be tokenized into a list of tokens.
    references: list of lists of references for each translation.
        Each reference should be tokenized into a list of tokens.
Returns: depending on the GLUE subset, one or several of:
    "accuracy": Accuracy
    "f1": F1 score
    "pearson": Pearson Correlation
    "spearmanr": Spearman Correlation
    "matthews_correlation": Matthew Correlation
Examples:
    >>> glue_metric = datasets.load_metric('glue', 'sst2')  # 'sst2' or any of ["mnli", "mnli_mismatched", "mnli_matched", "qnli", "rte", "wnli", "hans"]
    >>> references = [0, 1]
    >>> predictions = [0, 1]
    >>> results = glue_metric.compute(predictions=predictions, references=references)
    >>> print(results)
    {'accuracy': 1.0}
    ...
    >>> glue_metric = datasets.load_metric('glue', 'mrpc')  # 'mrpc' or 'qqp'
    >>> references = [0, 1]
    >>> predictions = [0, 1]
    >>> results = glue_metric.compute(predictions=predictions, references=references)
    >>> print(results)
    {'accuracy': 1.0, 'f1': 1.0}
    ...

Compute metric

Copied

>>> model_predictions = model(model_inputs)
>>> final_score = metric.compute(predictions=model_predictions, references=gold_references)

Before you begin using a object, you should get to know it a little better. As with a dataset, you can return some basic information about a metric. For example, access the inputs_description parameter in to get more information about a metrics expected input format and some usage examples:

Notice for the MRPC configuration, the metric expects the input format to be zero or one. For a complete list of attributes you can return with your metric, take a look at .

Once you have loaded a metric, you are ready to use it to evaluate a models predictions. Provide the model predictions and references to :

🌍
Evaluate
metrics
list_metrics()
load_metric()
Metric
datasets.MetricInfo
MetricInfo
compute()