Server-side Audio Processing in Node.js

Server-side Audio Processing in Node.js

A major benefit of writing code for the web is that you can access the multitude of APIs that are available in modern browsers. Unfortunately, when writing server-side code, we are not afforded such luxury, so we have to find another way. In this tutorial, we will design a simple Node.js application that uses Transformers.js for speech recognition with Whisper, and in the process, learn how to process audio on the server.

The main problem we need to solve is that the Web Audio API is not available in Node.js, meaning we can’t use the AudioContext class to process audio. So, we will need to install third-party libraries to obtain the raw audio data. For this example, we will only consider .wav files, but the same principles apply to other audio formats.

This tutorial will be written as an ES module, but you can easily adapt it to use CommonJS instead. For more information, see the node tutorial.

Useful links:

Prerequisites

Getting started

Let’s start by creating a new Node.js project and installing Transformers.js via NPM:

Copied

npm init -y
npm i @xenova/transformers

Remember to add "type": "module" to your package.json to indicate that your project uses ECMAScript modules.

Next, let’s install the wavefile package, which we will use for loading .wav files:

Copied

Creating the application

Start by creating a new file called index.js, which will be the entry point for our application. Let’s also import the necessary modules:

Copied

For this tutorial, we will use the Xenova/whisper-tiny.en model, but feel free to choose one of the other whisper models from the BOINC AI Hub. Let’s create our pipeline with:

Copied

Next, let’s load an audio file and convert it to the format required by Transformers.js:

Copied

Finally, let’s run the model and measure execution duration.

Copied

You can now run the application with node index.js. Note that when running the script for the first time, it may take a while to download and cache the model. Subsequent requests will use the cached model, and model loading will be much faster.

You should see output similar to:

Copied

That’s it! You’ve successfully created a Node.js application that uses Transformers.js for speech recognition with Whisper. You can now use this as a starting point for your own applications.

Last updated