Inference API docs

Inference API

Please refer to Inference API Documentation for detailed information.

What technology do you use to power the inference API?

For 🤗 Transformers models, Pipelines power the API.

On top of Pipelines and depending on the model type, there are several production optimizations like:

compiling models to optimized intermediary representations (e.g. ONNX),
maintaining a Least Recently Used cache, ensuring that the most popular models are always loaded,
scaling the underlying compute infrastructure on the fly depending on the load constraints.

For models from other libraries, the API uses Starlette and runs in Docker containers. Each library defines the implementation of different pipelines.

How can I turn off the inference API for my model?

Specify inference: false in your model card’s metadata.

For some tasks, there might not be support in the inference API, and, hence, there is no widget. For all libraries (except 🌍 Transformers), there is a mapping of library to supported tasks in the API. When a model repository has a task that is not supported by the repository library, the repository has inference: false by default.

Can I send large volumes of requests? Can I get accelerated APIs?

If you are interested in accelerated inference, higher volumes of requests, or an SLA, please contact us at api-enterprise at huggingface.co.

How can I see my usage?

You can head to the Inference API dashboard. Learn more about it in the Inference API documentation.

Is there programmatic access to the Inference API?

Yes, the huggingface_hub library has a client wrapper documented here.

PreviousWidget Examples NextFrequently Asked Questions

Last updated 1 year ago