Inference API docs

Inference API

Please refer to Inference API Documentationarrow-up-right for detailed information.

What technology do you use to power the inference API?

For πŸ€— Transformers models, Pipelinesarrow-up-right power the API.

On top of Pipelines and depending on the model type, there are several production optimizations like:

  • compiling models to optimized intermediary representations (e.g. ONNXarrow-up-right),

  • maintaining a Least Recently Used cache, ensuring that the most popular models are always loaded,

  • scaling the underlying compute infrastructure on the fly depending on the load constraints.

For models from other librariesarrow-up-right, the API uses Starlettearrow-up-right and runs in Docker containersarrow-up-right. Each library defines the implementation of different pipelinesarrow-up-right.

How can I turn off the inference API for my model?

Specify inference: false in your model card’s metadata.

Why don’t I see an inference widget or why can’t I use the inference API?

For some tasks, there might not be support in the inference API, and, hence, there is no widget. For all libraries (except 🌍 Transformers), there is a mappingarrow-up-right of library to supported tasks in the API. When a model repository has a task that is not supported by the repository library, the repository has inference: false by default.

Can I send large volumes of requests? Can I get accelerated APIs?

If you are interested in accelerated inference, higher volumes of requests, or an SLA, please contact us at api-enterprise at huggingface.co.

How can I see my usage?

You can head to the Inference API dashboardarrow-up-right. Learn more about it in the Inference API documentationarrow-up-right.

Is there programmatic access to the Inference API?

Yes, the huggingface_hub library has a client wrapper documented herearrow-up-right.

Last updated