How-to: Create automatic metadata quality reports
Webhook guide: Setup an automatic metadata quality review for models and datasets
Webhooks are now publicly available!
This guide will walk you through creating a system that reacts to changes to a user’s or organization’s models or datasets on the Hub and creates a ‘metadata review’ for the changed repository.
What are we building and why?
Before we dive into the technical details involved in this particular workflow, we’ll quickly outline what we’re creating and why.
Model cards and dataset cards are essential tools for documenting machine learning models and datasets. The BOINC AI Hub uses a README.md
file containing a YAML header block to generate model and dataset cards. This YAML
section defines metadata relating to the model or dataset. For example:
Copied
This metadata contains essential information about your model or dataset for potential users. The license, for example, defines the terms under which a model or dataset can be used. Hub users can also use the fields defined in the YAML
metadata as filters for identifying models or datasets that fit specific criteria.
Since the metadata defined in this block is essential for potential users of our models and datasets, it is important that we complete this section. In a team or organization setting, users pushing models and datasets to the Hub may have differing familiarity with the importance of this YAML metadata block. While someone in a team could take on the responsibility of reviewing this metadata, there may instead be some automation we can do to help us with this problem. The result will be a metadata review report automatically posted or updated when a repository on the Hub changes. For our metadata quality, this system works similarly to CI/CD.
You can also find an example review here.
Using the Hub Client Library to create a model review card
The boincai_hub
is a Python library that allows you to interact with the Hub. We can use this library to download model and dataset cards from the Hub using the DatasetCard.load
or ModelCard.load
methods. In particular, we’ll use these methods to load a Python dictionary, which contains the metadata defined in the YAML
of our model or dataset card. We’ll create a small Python function to wrap these methods and do some exception handling.
Copied
This function will return a Python dictionary containing the metadata associated with the repository (or an empty dictionary if there is no metadata).
Copied
Creating our metadata review report
Once we have a Python dictionary containing the metadata associated with a repository, we’ll create a ‘report card’ for our metadata review. In this particular instance, we’ll review our metadata by defining some metadata fields for which we want values. For example, we may want to ensure that the license
field has always been completed. To rate our metadata, we’ll count which metadata fields are present out of our desired fields and return a percentage score based on the coverage of the required metadata fields we want to see values.
Since we have a Python dictionary containing our metadata, we can loop through this dictionary to check if our desired keys are there. If a desired metadata field (a key in our dictionary) is missing, we’ll assign the value as None
.
Copied
This function will return a dictionary containing keys representing the metadata fields we require for our model or dataset. The dictionary values will either include the metadata entered for that field or None
if that metadata field is missing in the YAML
.
Copied
Once we have this dictionary, we can create our metadata report. In the interest of brevity, we won’t include the complete code here, but the BOINC AI Spaces repository for this Webhook contains the full code.
We create one function which creates a markdown table that produces a prettier version of the data we have in our metadata coverage dictionary.
Copied
We also have a Python function that generates a score (representing the percentage of the desired metadata fields present)
Copied
and a Python function that creates a markdown report for our metadata review. This report contains both the score and metadata table, along with some explanation of what the report contains.
Copied
How to post the review automatically?
We now have a markdown formatted metadata review report. We’ll use the boincai_hub
library to post this review. We define a function that takes back the Webhook data received from the Hub, parses the data, and creates the metadata report. Depending on whether a report has previously been created, the function creates a new report or posts a new issue to an existing metadata review thread.
Copied
`:=` is the Python Syntax for an assignment expression operator added to the Python language in version 3.8 (colloquially known as the walrus operator). People have mixed opinions on this syntax, and it doesn't change how Python evaluates the code if you don't use this. You can read more about this operator in this [Real Python article](https://realpython.com/python-walrus-operator/).
Creating a Webhook to respond to changes on the Hub
We’ve now got the core functionality for creating a metadata review report for a model or dataset. The next step is to use Webhooks to respond to changes automatically.
Create a Webhook in your user profile
First, create your Webhook by going to https://boincai.com/settings/webhooks.
Input a few target repositories that your Webhook will listen to (you will likely want to limit this to your own repositories or the repositories of the organization you belong to).
Input a secret to make your Webhook more secure (if you don’t know what to choose for this, you may want to use a password generator to generate a sufficiently long random string for your secret).
We can pass a dummy URL for the
Webhook URL
parameter for now.
Your Webhook will look like this:
Create a new Bot user profile
This guide creates a separate user account that will post the metadata reviews.
When creating a bot that will interact with other users on the Hub, we ask that you clearly label the account as a "Bot" (see profile screenshot).
Create a Webhook listener
We now need some way of listening to Webhook events. There are many possible tools you can use to listen to Webhook events. Many existing services, such as Zapier and IFTTT, can use Webhooks to trigger actions (for example, they could post a tweet every time a model is updated). In this case, we’ll implement our Webhook listener using FastAPI.
FastAPI is a Python web framework. We’ll use FastAPI to create a Webhook listener. In particular, we need to implement a route that accepts POST
requests on /webhook
. For authentication, we’ll compare the X-Webhook-Secret
header with a WEBHOOK_SECRET
secret that can be passed to our Docker container at runtime.
Copied
The above function will receive Webhook events and creates or updates the metadata review report for the changed repository.
Use Spaces to deploy our Webhook app
Our main.py file contains all the code we need for our Webhook app. To deploy it, we’ll use a Space.
For our Space, we’ll use Docker to run our app. The Dockerfile copies our app file, installs the required dependencies, and runs the application. To populate the KEY
variable, we’ll also set a WEBHOOK_SECRET
secret for our Space with the secret we generated earlier. You can read more about Docker Spaces here.
Finally, we need to update the URL in our Webhook settings to the URL of our Space. We can get our Space’s “direct URL” from the contextual menu. Click on “Embed this Space” and copy the “Direct URL”.
Once we have this URL, we can pass this to the Webhook URL
parameter in our Webhook settings. Our bot should now start posting reviews when monitored repositories change!
Conclusion and next steps
We now have an automatic metadata review bot! Here are some ideas for how you could build on this guide:
The metadata review done by our bot was relatively crude; you could add more complex rules for reviewing metadata.
You could use the full
README.md
file for doing the review.You may want to define ‘rules’ which are particularly important for your organization and use a webhook to check these are followed.
If you build a metadata quality app using Webhooks, please tag me @davanstrien; I would love to know about it!
Last updated