"The best way to learn jazz is to listen to it.”— Oscar Peterson

Sometimes I marvel at beautiful music, wondering how to reproduce it. This process is called “Playing by Ear” or “Transcription”. Just as a child learns to speak by listening first, music can be learnt the same way.

But there’s a problem…

Transcribing music is time-consuming. It takes a professional musician roughly 4–60 minutes to transcribe 1 minute of music. But beginning musicians who benefit most from transcriptions may not have the skills to transcribe.

We need a computational transcription method

To illustrate the transcription process, I will use Bach’s Passacaglia C Minor (BWV 582) as an example:

Modified image from David Abbot’s Understanding Sound

We could identify the energy of each frequency through time with a spectrogram. But the actual notes played are in green. Somehow we need to identify the base frequency of the notes played.

Modified image from David Abbot’s Understanding Sound

AI to the rescue!

Luckily some employees from ByteDance (TikTok’s parent company) were working on a piano transcription algorithm. The result is a Python module for inferencing that converts audio files to midi: https://github.com/qiuqiangkong/piano_transcription_inference

How does this module work?

Preprocess: – Downsamples the original audio. – Separate audio into segments.
Inference: – Generate a spectrogram from the audio segment – Perform CNN inferencing on the spectrogram.
Postprocess: – Stitch rough midi output together. – Perform regression to find the most likely midi events.

Core backend architecture

An ideal piano transcription service would be blazingly fast. So this would be the main focus of architecture exploration below.

Local server on M1 Pro — v0

A few observations when running locally: – Enabling torch multicore processing speeds up inferencing. – Up to 3 cores are used, beyond that there is no improvement and utilisation. – Transcription rate of 0.5 seconds per 1 second of audio.

Running locally gives some idea of how performance could be improved before applying to the cloud. I find that iterating on the cloud takes much longer than local development.

Naive server — v1

Naive server architecture

Although this works, there are many limitations:

API Gateway limits payloads to 6MB — equivalent to a 6-minute mp3 file.
Lambda’s computation speed is slow — 3 seconds per 1 second of audio
API Gateway enforces a 30-second timeout — can transcribe up to 10 seconds of audio.

Monolithic server — v2

Monolithic server

Presigned POST urls can upload up to 5GB files to S3. Lambda can download (to /tmp folder only) and upload to S3 using AWS SDK.

Lambda has 900-second timeout — can transcribe up to 5 minutes of audio.

Step Function server — v3

Step Function server

Step Function orchestrator can run inference lambdas in parallel — this prevents lambda timeouts, so you can transcribe over 1 hour of audio: – Inference takes 12.5 seconds for 5-second segment – Can achieve a transcription rate of 0.25 seconds per 1 second of audio.

Credits:

Melbourne AWS User Group — inspired me to use Step Functions.
Melbourne Serverless Meetup — exposed me to serverless architecture.
Figma — image e ditor + AWS icons.

Fast piano transcription on AWS -Part 1

Table of contents