Fast piano transcription on AWS -Part 2

·

5 min read

Previously, I tested a few architectures for music transcription. I concluded that separating the preprocessing, inferencing and postprocessing steps for parallel processing is crucial to improving model speed — refer to the architecture below:

Implemented architecture

In this article, I will explore further optimisations.


Improving inference speed with ONNX

Replacing PyTorch with ONNX improves inference speed by 2x.

The benchmark metric I will use is the average 3 second audio frame inference time (FIT) — The FIT for PyTorch CPU on M1 Pro is around 1.1 seconds, using around 3 CPU cores.

The 172 Mb PTH model weight file can be reduced to 151 Mb ONNX file, which also embeds some of the preprocessing steps like generating a spectrogram. Running an ONNX file on Microsoft’s ONNXruntime reduces FIT to 0.56 seconds using 3 CPU cores — 2x improvement!

Converting PTH to ONNX only takes a couple of lines:

import torch

pytorch_model = ...
dummy_input = ...
modelPath = ...
input_names = ['input']
output_names = [...]
torch.onnx.export(pytorch_model, dummy_input, modelPath,
  input_names=input_names,
  output_names=output_names
)

Running ONNX on ONNXruntime:

import onnxruntime as ort

# Inference with CPU
onnx_path = ...
model = ort.InferenceSession(onnx_path, providers=['CPUExecutionProvider'])

# Outputs will be a list of tensors
input = ...
outputs = model.run(None, {'input': input})

# You can convert the outputs to dicts
output_names = [...]
output_dict = {}
  for i in range(len(output_names)):
    output_dict[output_names[i]] = outputs[I]

Quantising a model is also possible. This reduces the model to 72 Mb and uses float16 calculations. However, this did increase FIT to 0.65:

import onnx
from onnxconverter_common import float16

onnx_path = ...
onnx_model = onnx.load(onnx_path)

onnx_f16_path = ...
onnx_model_f16 = float16.convert_float_to_float16(onnx_model)
onnx.save(onnx_model_f16, onnx_f16_path)

ONNXruntime ARM Lambda issues

Ensure the CPU architecture and OS of CI/CD and cloud servers are similar. This overcomes incompatibilities between development environment and cloud environment.

At the time of writing this article, ARM Lambda doesn’t provide CPU info from “/sys/devices” for ONNXruntime calculation optimisation. I suspect that CPU information is not supported for the proprietary Graviton Chip.

Currently, x86_64 Lambdas do work with ONNXruntime. A CI/CD workflow on an x86_64 server is an easy way to deploy x86_64 Lambdas from Apple Silicon — or any ARM machine. Some Python libraries like ONNXruntime depend on binaries, so where the code is deployed from matters.

Why not GPU acceleration?

For on-demand, parallelised inferencing, consider Step Function Maps with CPU.

GPU acceleration seemed like an obvious solution. However, the bottleneck for inferencing in this application is concurrency. Scaling up hundreds of GPU inferences quickly is challenging. Furthermore, GPUs are expensive.

A quantised ONNX model on x86_64 Lambda with 1.7 vCPUs achieves a FIT of 2.3 seconds. The cold start FIT is around 10 seconds.

On Google Colab, a V100 GPU can achieve a fit of around 2.5 seconds. Together with GPU fast scalability issues, serverless CPU inferencing is a more suitable solution.


From Docker to zipping static binaries

Using Docker files for rapid prototyping, then zipped files for optimisation.

Docker speeds up deployments since only changed layers need to be uploaded. But the downside is massive files that increase cold start time. My PyTorch Docker file was 1.9 Gb and ONNX was 1.1 Gb.

In contrast, zipped deployments are small, leading to small cold start times. However, installing Python packages, zipping and uploading… takes so much time, especially when you are debugging.

Packaging static FFmpeg

Package static binaries instead of apt download where possible to reduce size.

The naive method to installing FFmpeg through apt install, even with no recommended installs. Even with multi-stage builds:

# Adds over 300 Mb to Docker file (Debian)
RUN apt update –no-install-recommends && \
    apt install -y ffmpeg && \
    apt clean

While lazy-loaded FFmpeg libraries do exist, Lambda restricts file writing to the “/tmp” folder. An alternative strategy is to use a FFmpeg static binary tailored to OS and CPU architecture. Simply call the path to static binary to use:

# 300 Mb apt install is equivalent to 78 Mb FFmpeg binary
path/to/ffmpeg/binary ... arg1 arg2

# To use the 'ffmpeg' command, create an alias, assuming it's not taken
echo 'alias ffmpeg="path/to/ffmpeg/binary"' > ~/.bashrc
source ~/.bashrc

ffmpeg ... arg1 arg2

The binary is around 78 Mb, so be sure to maximise usage. For example, there is no need to write downsampling audio code when FFmpeg can downsample much faster.


Double/float64 is faster than float32 or float16

Prefer a type that has native hardware support.

Float16 is much smaller than float64, so it should be faster? — NOT NECESSARILY! Whatever is supported by hardware is faster. For instance, Lambda x86_64 runs on top of Intel Xeon and AMD EPYC, which natively supports float64 calculations and matrix dot products.

To illustrate, postprocessing example.mp3 completes in 0.107 seconds using np.float32 on M1 Pro. This is reduced to 0.014 seconds using np.float64. Float64 turned out to be 7.5x faster than float32 or float16.

Even when deployed onto the cloud, an 8 minute audio is transcribed in 60 seconds using float16. Float64 not only has higher precision but also transcribes in 28 seconds.


Conclusion

The most important lesson is to measure performance and not blindly listen to others. For all you know, maybe GPU inferencing is faster for one use case, then CPU for another. Questioning assumptions and believing that something can be improved eventually led to 2x inferencing improvement — 5x faster than inferencing on M1 Pro for an 8 minute audio!