Accelerating Inference
Accelerating Inference
Gaudi offers several possibilities to make inference faster.
Lazy Mode
Two execution modes are proposed:
Lazy mode, where operations are accumulated in a graph whose execution is triggered in a lazy manner. This allows the graph compiler to optimize the device execution for these operations.
Eager mode, where one operation at a time is executed.
In lazy mode, the graph compiler generates optimized binary code that implements the given model topology on Gaudi. It performs operator fusion, data layout management, parallelization, pipelining and memory management, as well as graph-level optimizations.
To execute inference in lazy mode, you must provide the following arguments:
Copied
In lazy mode, the last batch may trigger an extra compilation because it could be smaller than previous batches. To avoid this, you can discard the last batch with dataloader_drop_last=True
.
HPU Graphs
GaudiTrainer
needs the training argumentuse_hpu_graphs_for_inference
to be set toTrue
as follows:
Copied
GaudiStableDiffusionPipeline
needs its argumentuse_hpu_graphs
to be set toTrue
such as:
Copied
With HPU Graphs and in lazy mode, the first couple of iterations may be slower due to graph compilations.
Last updated