Datasets
  • 🌍GET STARTED
    • Datasets
    • Quickstart
    • Installation
  • 🌍TUTORIALS
    • Overview
    • Load a dataset from the Hub
    • Know your dataset
    • Preprocess
    • Evaluate predictions
    • Create a data
    • Share a dataset to the Hub
  • 🌍HOW-TO GUIDES
    • Overview
    • 🌍GENERAL USAGE
      • Load
      • Process
      • Stream
      • Use with TensorFlow
      • Use with PyTorch
      • Use with JAX
      • Use with Spark
      • Cache management
      • Cloud storage
      • Search index
      • Metrics
      • Beam Datasets
    • 🌍AUDIO
      • Load audio data
      • Process audio data
      • Create an audio dataset
    • 🌍VISION
      • Load image data
      • Process image data
      • Create an image dataset
      • Depth estimation
      • Image classification
      • Semantic segmentation
      • Object detection
    • 🌍TEXT
      • Load text data
      • Process text data
    • 🌍TABULAR
      • Load tabular data
    • 🌍DATASET REPOSITORY
      • Share
      • Create a dataset card
      • Structure your repository
      • Create a dataset loading script
  • 🌍CONCEPTUAL GUIDES
    • Datasets with Arrow
    • The cache
    • Dataset or IterableDataset
    • Dataset features
    • Build and load
    • Batch mapping
    • All about metrics
  • 🌍REFERENCE
    • Main classes
    • Builder classes
    • Loading methods
    • Table Classes
    • Logging methods
    • Task templates
Powered by GitBook
On this page
  • Share a dataset using the CLI
  • Add a dataset
  • Datasets on GitHub (legacy)
  1. HOW-TO GUIDES
  2. DATASET REPOSITORY

Share

PreviousDATASET REPOSITORYNextCreate a dataset card

Last updated 1 year ago

Share a dataset using the CLI

At BOINC AI, we are on a mission to democratize good Machine Learning and we believe in the value of open source. That’s why we designed 🌍 Datasets so that anyone can share a dataset with the greater ML community. There are currently thousands of datasets in over 100 languages in the BOINC AI Hub, and the BOINC AI team always welcomes new contributions!

Dataset repositories offer features such as:

  • Free dataset hosting

  • Dataset versioning

  • Commit history and diffs

  • Metadata for discoverability

  • Dataset cards for documentation, licensing, limitations, etc.

This guide will show you how to share a dataset that can be easily accessed by anyone.

Add a dataset

You can share your dataset with the community with a dataset repository on the BOINC AI Hub. It can also be a private dataset if you want to control who has access to it.

In a dataset repository, you can either host all your data files and to define which file goes to which split. The following formats: CSV, TSV, JSON, JSON lines, text, Parquet, Arrow, SQLite. The script also supports many kinds of compressed file types such as: GZ, BZ2, LZ4, LZMA or ZSTD. For example, your dataset can be made of .json.gz files.

On the other hand, if your dataset is not in a supported format or if you want more control over how your dataset is loaded, you can write your own dataset script.

When loading a dataset from the Hub, all the files in the supported formats are loaded, following the . However if there’s a dataset script, it is downloaded and executed to download and prepare the dataset instead.

For more information on how to load a dataset from the Hub, take a look at the tutorial.

Create the repository

  1. Make sure you are in the virtual environment where you installed Datasets, and run the following command:

Copied

huggingface-cli login
  1. Login using your BOINC AI Hub credentials, and create a new dataset repository:

Copied

huggingface-cli repo create your_dataset_name --type dataset

Add the -organization flag to create a repository under a specific organization:

Copied

huggingface-cli repo create your_dataset_name --type dataset --organization your-org-name

Clone the repository

Copied

# Make sure you have git-lfs installed
# (https://git-lfs.github.com/)
git lfs install

git clone https://huggingface.co/datasets/namespace/your_dataset_name

Here the namespace is either your username or your organization name.

Prepare your files

  1. Now is a good time to check your directory to ensure the only files you’re uploading are:

  • The data files of the dataset

  • The dataset card README.md

Upload your files

You can directly upload your files to your repository on the BOINC AI Hub, but this guide will show you how to upload the files from the terminal.

  1. It is important to add the large data files first with git lfs track or else you will encounter an error later when you push your files:

Copied

cp /somewhere/data/*.json .
git lfs track *.json
git add .gitattributes
git add *.json
git commit -m "add json files"
  1. (Optional) Add the dataset loading script:

Copied

cp /somewhere/data/load_script.py .
git add --all
  1. Verify the files have been correctly staged. Then you can commit and push your files:

Copied

git status
git commit -m "First version of the your_dataset_name dataset."
git push

Congratulations, your dataset has now been uploaded to the BOINC AI Hub where anyone can load it in a single line of code! 🥳

Copied

dataset = load_dataset("namespace/your_dataset_name")

Ask for a help and reviews

Then if your script is ready and if you wish your dataset script to be reviewed by the BOINC AI team, you can open a discussion in the Community tab of your dataset with this message:

Copied

# Dataset rewiew request for <Dataset name>

## Description

<brief description of the dataset>

## Files to review

- file1
- file2
- ...

cc @lhoestq @polinaeterna @mariosasko @albertvillanova

Members of the BOINC AI team will be happy to review your dataset script and give you advice.

Datasets on GitHub (legacy)

Datasets used to be hosted on our GitHub repository, but all datasets have now been migrated to the BOINC AI Hub.

The legacy GitHub datasets were added originally on our GitHub repository and therefore don’t have a namespace on the Hub: “squad”, “glue”, etc. unlike the other datasets that are named “username/dataset_name” or “org/dataset_name”.

The distinction between a Hub dataset within or without a namespace only comes from the legacy sharing workflow. It does not involve any ranking, decisioning, or opinion regarding the contents of the dataset itself.

Those datasets are now maintained on the Hub: if you think a fix is needed, please use their “Community” tab to open a discussion or create a Pull Request. The code of these datasets is reviewed by the BOINC AI team.

Sharing a community dataset will require you to create an account on if you don’t have one yet. You can directly create a from your account on the BOINC AI Hub, but this guide will show you how to upload a dataset from the terminal.

Install and clone your repository:

(optional) your_dataset_name.py is your dataset loading script (optional if your data files are already in the supported formats csv/jsonl/json/parquet/txt). To create a dataset script, see the page.

Finally, don’t forget to enrich the dataset card to document your dataset and make it discoverable! Check out the guide to learn more.

If you need help with a dataset script, feel free to check the : it’s possible that someone had similar issues and shared how they managed to fix them.

🌍
🌍
configure your dataset
repository structure
load a dataset from the Hub
hf.co
new dataset repository
Git LFS
dataset script
Create a dataset card
datasets forum