How-to: Create automatic metadata quality reports

Webhook guide: Setup an automatic metadata quality review for models and datasets

Webhooks are now publicly available!

This guide will walk you through creating a system that reacts to changes to a user’s or organization’s models or datasets on the Hub and creates a ‘metadata review’ for the changed repository.

What are we building and why?

Before we dive into the technical details involved in this particular workflow, we’ll quickly outline what we’re creating and why.

Model cards and dataset cards are essential tools for documenting machine learning models and datasets. The BOINC AI Hub uses a README.md file containing a YAML header block to generate model and dataset cards. This YAML section defines metadata relating to the model or dataset. For example:

Copied

---
language: 
  - "List of ISO 639-1 code for your language"
  - lang1
  - lang2
tags:
- tag1
- tag2
license: "any valid license identifier"
datasets:
- dataset1
---

This metadata contains essential information about your model or dataset for potential users. The license, for example, defines the terms under which a model or dataset can be used. Hub users can also use the fields defined in the YAML metadata as filters for identifying models or datasets that fit specific criteria.

Since the metadata defined in this block is essential for potential users of our models and datasets, it is important that we complete this section. In a team or organization setting, users pushing models and datasets to the Hub may have differing familiarity with the importance of this YAML metadata block. While someone in a team could take on the responsibility of reviewing this metadata, there may instead be some automation we can do to help us with this problem. The result will be a metadata review report automatically posted or updated when a repository on the Hub changes. For our metadata quality, this system works similarly to CI/CD.

You can also find an example review here.

Using the Hub Client Library to create a model review card

The boincai_hub is a Python library that allows you to interact with the Hub. We can use this library to download model and dataset cards from the Hub using the DatasetCard.load or ModelCard.load methods. In particular, we’ll use these methods to load a Python dictionary, which contains the metadata defined in the YAML of our model or dataset card. We’ll create a small Python function to wrap these methods and do some exception handling.

Copied

from boincai_hub import DatasetCard, ModelCard
from boincai_hub.utils import EntryNotFoundError 

def load_repo_card_metadata(repo_type, repo_name):
    if repo_type == "dataset":
        try:
            return DatasetCard.load(repo_name).data.to_dict()
        except EntryNotFoundError:
            return {}
    if repo_type == "model":
        try:
            return ModelCard.load(repo_name).data.to_dict()
        except EntryNotFoundError:
            return {}

This function will return a Python dictionary containing the metadata associated with the repository (or an empty dictionary if there is no metadata).

Copied

{'license': 'afl-3.0'}

Creating our metadata review report

Once we have a Python dictionary containing the metadata associated with a repository, we’ll create a ‘report card’ for our metadata review. In this particular instance, we’ll review our metadata by defining some metadata fields for which we want values. For example, we may want to ensure that the license field has always been completed. To rate our metadata, we’ll count which metadata fields are present out of our desired fields and return a percentage score based on the coverage of the required metadata fields we want to see values.

Since we have a Python dictionary containing our metadata, we can loop through this dictionary to check if our desired keys are there. If a desired metadata field (a key in our dictionary) is missing, we’ll assign the value as None.

Copied

def create_metadata_key_dict(card_data, repo_type: str):
    shared_keys = ["tags", "license"]
    if repo_type == "model":
        model_keys = ["library_name", "datasets", "metrics", "co2", "pipeline_tag"]
        shared_keys.extend(model_keys)
        keys = shared_keys
        return {key: card_data.get(key) for key in keys}
    if repo_type == "dataset":
        # [...]

This function will return a dictionary containing keys representing the metadata fields we require for our model or dataset. The dictionary values will either include the metadata entered for that field or None if that metadata field is missing in the YAML.

Copied

{'tags': None,
 'license': 'afl-3.0',
 'library_name': None,
 'datasets': None,
 'metrics': None,
 'co2': None,
 'pipeline_tag': None}

Once we have this dictionary, we can create our metadata report. In the interest of brevity, we won’t include the complete code here, but the BOINC AI Spaces repository for this Webhook contains the full code.

We create one function which creates a markdown table that produces a prettier version of the data we have in our metadata coverage dictionary.

Copied

def create_metadata_breakdown_table(desired_metadata_dictionary):
    # [...]
    return tabulate(
        table_data, tablefmt="github", headers=("Metadata Field", "Provided Value")
    )

We also have a Python function that generates a score (representing the percentage of the desired metadata fields present)

Copied

def calculate_grade(desired_metadata_dictionary):
    # [...]
    return round(score, 2)

and a Python function that creates a markdown report for our metadata review. This report contains both the score and metadata table, along with some explanation of what the report contains.

Copied

def create_markdown_report(
    desired_metadata_dictionary, repo_name, repo_type, score, update: bool = False
):
    # [...]
    return report

How to post the review automatically?

We now have a markdown formatted metadata review report. We’ll use the boincai_hub library to post this review. We define a function that takes back the Webhook data received from the Hub, parses the data, and creates the metadata report. Depending on whether a report has previously been created, the function creates a new report or posts a new issue to an existing metadata review thread.

Copied

def create_or_update_report(data):
    if parsed_post := parse_webhook_post(data):
        repo_type, repo_name = parsed_post
    else:
        return Response("Unable to parse webhook data", status_code=400)
    # [...]
    return True

`:=` is the Python Syntax for an assignment expression operator added to the Python language in version 3.8 (colloquially known as the walrus operator). People have mixed opinions on this syntax, and it doesn't change how Python evaluates the code if you don't use this. You can read more about this operator in this [Real Python article](https://realpython.com/python-walrus-operator/).

Creating a Webhook to respond to changes on the Hub

We’ve now got the core functionality for creating a metadata review report for a model or dataset. The next step is to use Webhooks to respond to changes automatically.

Create a Webhook in your user profile

First, create your Webhook by going to https://boincai.com/settings/webhooks.

Input a few target repositories that your Webhook will listen to (you will likely want to limit this to your own repositories or the repositories of the organization you belong to).
Input a secret to make your Webhook more secure (if you don’t know what to choose for this, you may want to use a password generator to generate a sufficiently long random string for your secret).
We can pass a dummy URL for the Webhook URL parameter for now.

Your Webhook will look like this:

Create a new Bot user profile

This guide creates a separate user account that will post the metadata reviews.

When creating a bot that will interact with other users on the Hub, we ask that you clearly label the account as a "Bot" (see profile screenshot).

Create a Webhook listener

We now need some way of listening to Webhook events. There are many possible tools you can use to listen to Webhook events. Many existing services, such as Zapier and IFTTT, can use Webhooks to trigger actions (for example, they could post a tweet every time a model is updated). In this case, we’ll implement our Webhook listener using FastAPI.

FastAPI is a Python web framework. We’ll use FastAPI to create a Webhook listener. In particular, we need to implement a route that accepts POST requests on /webhook. For authentication, we’ll compare the X-Webhook-Secret header with a WEBHOOK_SECRET secret that can be passed to our Docker container at runtime.

Copied

from fastapi import FastAPI, Request, Response
import os

KEY = os.environ.get("WEBHOOK_SECRET")

app = FastAPI()

@app.post("/webhook")
async def webhook(request: Request):
    if request.method == "POST":
        if request.headers.get("X-Webhook-Secret") != KEY:
            return Response("Invalid secret", status_code=401)
        data = await request.json()
        result = create_or_update_report(data)
        return "Webhook received!" if result else result

The above function will receive Webhook events and creates or updates the metadata review report for the changed repository.

Use Spaces to deploy our Webhook app

Our main.py file contains all the code we need for our Webhook app. To deploy it, we’ll use a Space.

For our Space, we’ll use Docker to run our app. The Dockerfile copies our app file, installs the required dependencies, and runs the application. To populate the KEY variable, we’ll also set a WEBHOOK_SECRET secret for our Space with the secret we generated earlier. You can read more about Docker Spaces here.

Finally, we need to update the URL in our Webhook settings to the URL of our Space. We can get our Space’s “direct URL” from the contextual menu. Click on “Embed this Space” and copy the “Direct URL”.

Once we have this URL, we can pass this to the Webhook URL parameter in our Webhook settings. Our bot should now start posting reviews when monitored repositories change!

Conclusion and next steps

We now have an automatic metadata review bot! Here are some ideas for how you could build on this guide:

The metadata review done by our bot was relatively crude; you could add more complex rules for reviewing metadata.
You could use the full README.md file for doing the review.
You may want to define ‘rules’ which are particularly important for your organization and use a webhook to check these are followed.

If you build a metadata quality app using Webhooks, please tag me @davanstrien; I would love to know about it!

PreviousHow-to: Build a Discussion bot based on BLOOM NextRepository size recommendations

Last updated 1 year ago