An AI-based Method for Student Piano Performance Assessment
by Joseph Aquino, Peter Chudinov, and Mohamed Sahnoun
Learning to play piano (or any other musical instrument for that matter) is a daunting task that requires mentor support and many hours of guided practice. Piano teachers are scarce, their services are costly, and yet, a feedback loop is essential for student success [1]. In the twenty-first century, the availability of computational tools that facilitate music research, and advances in machine learning, can help both students and teachers make faster progress and provide quantitative insight into the process of piece-learning.
As a part of Fellowship.AI, we were tasked to develop a system that “moderates music” - a very open-ended task. Given a short timeframe to complete the project, our first step was to explore literature and resources that exist in the field. The first technology that drew our attention was automatic music transcription (AMT) - deep learning algorithms that convert audio files into MIDI sequences. Current AMT algorithms are capable of detecting 95%+ of notes correctly [2], and the most advanced can process a multitude of instruments on one track [3]. MIDI format, being the de facto standard in digital music processing and synthesis, naturally has a large community and well-maintained tools around it, such as processing packages [4] and visualization libraries [5]. Raw audio, essentially being a time series, also has a multitude of tools available for processing of all kinds, such as librosa [6]. Sheet music can also be converted into MIDI files using optical character recognition (OCR) technologies. Just like written text that came from a paper scanner, a picture of sheet music can be converted into a computer-readable format [7].
Armed with these tools, and a digital piano, we recorded a few sample audio files, downloaded some professional piano performances and set sail to create as many visualizations as possible - we needed to see visual patterns not only for music that pleases the ear, but for that which sounds flawed.
The first thing we delivered was an improved tempo analysis algorithm - an improvement on existing methods to detect tempo, as we found them to be less than ideal for classical/student piano. Using predominant local pulse (PLP) [8], we find the beat note of each measure. We then extract the tempo in BPM by calculating the time between each beat.
To better understand the process, let’s go over the steps. First, from the generated mel spectrogram (left image), we extract the onset envelope **(center image blue) that corresponds to the moment when a note is triggered [6]. Then a pulse is detected on the onset envelope (center image orange, right image blue), and at the peak of each pulse we find our beats (red dots in the right image).
A challenge with using PLP to extract tempo is that it can potentially be multiplied or divided by a factor of 2 or 3. Thus, we compared the reference BPM with the factorized PLP result to find a more reliable BPM estimate.
Using our improved algorithm, we then decided to compare the tempo of a See Siang Wong performance of “River Flows In You” to that of a version we synthesized from a score found online.
Right away we saw that the virtuoso performance has inconsistent tempo, going faster and slower, giving the piece a subjectively lively feel, while the synthesized version sounds quite monotonous, even with the right dynamics programmed into the MIDI file. It became obvious that comparing expressive performances to synthetic audio files to find errors did not make sense. We needed to find a way to judge a piano performance based on the original intent of the composer, that would also allow room for expression. So we decided to revisit the idea of transcribing student recordings into MIDI files with the use of automatic music transcription algorithms.
Our initial experiments with existing AMT models ran into some problems. Many papers we read either failed to provide source code or pre-trained models. And when the necessary code was available, oftentimes the correct working envirnoment was poorly explained. This even applied to some high-profile papers in the field.
After a week of searching we came across Kwon et al. [9], an AMT algorithm with a pretrained model that is capable of working in real time. Digital piano tests showed very good results.
Live operation of Kwon’s algorithm was fantastic, showing the exact notes triggered when they were played, but when saving the recording as a MIDI file, we noticed that some notes were missing. Unfortunately, detection of note offsets (think releasing the piano key) is still a challenging task even for state-of-the-art AMT models, with Kwon’s model only acheiving an onset + offset F-score of 79.4%. However, the detection accuracy for onsets (triggering of the note) is exceptional with an F-score of 94.7%. So we decided to only focus on onset detection and assign each transcribed note a fixed length of a crotchet (quarter note). This way, we became capable of transcribing piano recordings very accurately.
Now that we could accurately transcribe piano recordings, we could compare them to reference scores - sheet music (with the use of OCR) or simply MIDI files (obtained from websites like MuseScore). But we couldn’t just take a student transcription and compare it to a reference score side by side - pauses, note length variations, and, most importantly, missed and extra notes add up. We needed a more sophisticated way to make a comparison.
Dynamic time warping (DTW) is a mathematical method that allows you to align two time series by matching their values, shown visually with linear connections between values [10]. If we plot note onsets against time, we can find corresponding notes.
Although straightforward, we ran into some note alignment issues, particularly when processing chords. In the process of troubleshooting and digging through the literature, we found a paper that would become a cornerstone of our project.
In 2017, Eita Nakamura et al. [11] published a paper along with a C++ implementation of a MIDI-to-MIDI alignment algorithm. It can take two MIDI files (the piano performance and the reference score), connect matching notes, and identify missing and extra notes of the performance through the use of hidden Markov models (HMMs). And it does all of this with little compuational cost and astonishing accuracy - between 99.2% and 99.8% on various datasets.
With the addition of this powerful tool, we could now properly compare a recorded piano performance to a reference score, allowing us to create a system that gives quantifiable and actionable feedback to a student. However, we were building our pipeline in Python, while the alignment tool was built with C++ and produced poorly structured value files as output. Therefore, we had to create an interface to embed the alignment tool into our pipeline. This is where pandas came in - a Python package that enables you to easily store and manipulate data with Dataframes. It allowed us to handle the output files with relative ease.
With AMT and the alignment tool in place, the last part of our pipeline would be an API. Using Flask we spun up a small web server that takes an audio recording of a user playing piano and a reference MIDI file as input. It then responds with a list of notes played by the user, labeled as “reference” (if the note is correct), “incorrect”, “missing” or “extra”. The API will be available for free use on the Launchpad.AI website soon.
Of course, an API on its own is not very user-friendly, so we decided to make a mobile app that would handle the audio recording, API interaction and the display of annotated sheet music. Since we did not have a Swift developer on our team, we opted to remove real-time functionality from the app and build it with React Native. It came out to be very minimal, yet fully functional. Here are some videos of our app in action:
The first video shows a correctly played piece, and as expected, the app detects no errors:
In the second video, two chords are played incorrectly (the second only partially), and the app properly detects the misplayed notes:
And the last video shows the performer miss the second to last chord - our app correctly detects these errors as well:
Finally, we still needed to test the effectiveness of our process and compare it to the current state-of-the-art solutions. But first we had to find the right dataset; it would have to include several recordings of piano performances containing errors (extra/missed notes) each with its own accurate transcription (MIDI file) along with a reference MIDI score. We found such a dataset while looking for comparable papers. Benetos et al. [12] created a dataset of 7 well-known songs with all of these specifications.
Now that we had our dataset for testing in hand, we needed to find similar papers to compare our results against. After an extensive search, we found 2 papers that focused on detecting note errors in piano performance recordings: Benetos et al. [12] and Wang et al. [13]. And since we used the same dataset (amended version) for testing, the results would be relatively easy to compare.
In relation to our process, the 2 comparable papers use very different methods to detect errors in piano performances. While we use neural network based AMT and hidden Markov model based alignment (MIDI-to-MIDI), they both use non-negative matrix factorization (NMF) for AMT and modified forms of dynamic time warping (DTW) for alignment (MIDI-to-audio). Another difference is that Benetos et al. and Wang et al. use **information from the musical score to improve the transcription of the recorded performance.
In Benetos et al., the score is first aligned to the audio, and then after synthesizing the score, both the real and and synthesized audio are transcribed using a novel NMF-based AMT method. When both the transcription of the synthesized score and that of the performance contain notes not present in the score (as well as the opposite situation), the error is likely due to the AMT algorithm itself (often octave errors). By changing the transcription of the performance to match that of the score in these situations, the number of transcription errors is reduced.
While Benetos et al. uses the score to post-process transcription results obtained via AMT, Wang et al. integrates the score information into the transcription process itself. In Wang et al., after the score is aligned to the audio, the given recording is transcribed with a NMF-based AMT method that uses the score information to learn an optimized dictionary of spectral templates for each pitch.
All three methods yield a transcription of the performance along with a classification into correct, extra, and missing notes (in reference to the score). By comparing this to the ground-truth transcription of the performance provided by the test dataset (also annotated as correct, extra, and missing notes), the classification accuracy of each method can be evaluated.
When evaluated on the test dataset, Benetos et al. has an F-score of 92.9% while Wang et al. achieves an F-score of 98.0%. With an F-score of 95.8%, we soundly outperform Benetos et al. but are unable to match the state-of-the-art. However, unlike Wang et al., our pipeline will work with any AMT model, so as AMT technology improves, our classification accuracy will as well. But note classification accuracy does not tell the whole story.
When considering computational complexity, our method outperforms both Benetos et al. and Wang et al. Since they both use NMF to perform AMT, their methods need to be trained for each individual song [14]. Wang et al. has particularly high computation times (approximately 1 minute per minute of recorded performance), since a large number of spectral templates must be learned for each pitch. In contrast, our neural network based AMT model is pretrained (on the MAESTRO Dataset [15] - the largest dataset for piano-based AMT research), resulting in significantly shorter computation times.
In conclusion, our mission to moderate music has culminated with the creation of an elite piano performance assessment method for detecting note errors. And in order to make our method accessible to piano students, we also developed an API and a mobile app. While it does not quite reach the accuracy of Wang et al., our method makes up for this with reduced computation time - an important aspect of a music tutoring system. Going forward, if we were to replace our AMT model with that of Ou et al. [2] (onset F-score of 96.8%) or add score-informed post-processing similar to that of Benetos et al., there is little doubt that our note classification accuracy would surpass that of Wang et al..
We thoroughly enjoyed working on this fascinating project. Along the way, we learned a significant amount about many aspects of AI, ML and music information retrieval, and we hope that our work can be used to improve upon music education. We would like to thank everyone at Fellowship.AI for their support and guidance including the founder, Arshak Navruzyan, our mentor, Amit Borundiya, and all of the excellent fellows we worked alongside.
REFERENCES
- Kim, Hyon, et al. "An overview of automatic piano performance assessment within the music education context." (2022).
- Ou, Longshen, et al. "Exploring Transformer’s Potential on Automatic Piano Transcription." ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022.
- Gardner, Joshua P., et al. "MT3: Multi-Task Multitrack Music Transcription." International Conference on Learning Representations. 2021.
- Cuthbert, Michael Scott, and Christopher Ariza. "music21: A toolkit for computer-aided musicology and symbolic music data." (2010).
- Raffel, Colin, and Daniel PW Ellis. "DATA WITH pretty_midi.”
- librosa, Version 0.9.2, Zenodo, 2022.
- oemer: An End-to-end Optical Music Recognition Tool, Version 1.1, Zenodo, 2022.
- Grosche, Peter, and Meinard Muller. "Extracting predominant local pulse information from music recordings." IEEE Transactions on Audio, Speech, and Language Processing 19.6 (2010): 1688-1701.
- Kwon, Taegyun, Dasaem Jeong, and Juhan Nam. "Polyphonic Piano Transcription Using Autoregressive Multi-State Note Model." The 21th International Society for Music Information Retrieval Conference (ISMIR). International Society for Music Information Retrieval, 2020.
- Sankoff, D., and J. Kruskal. "The symmetric time-warping problem: from continuous to discrete." Time Warps, String Edits and Macromolecules: The Theory and Practice of Sequence Comparison (1983): 125-161.
- Nakamura, Eita, Kazuyoshi Yoshii, and Haruhiro Katayose. "Performance Error Detection and Post-Processing for Fast and Accurate Symbolic Music Alignment." ISMIR. 2017.
- Benetos, Emmanouil, Anssi Klapuri, and Simon Dixon. "Score-informed transcription for automatic piano tutoring." 2012 Proceedings of the 20th European Signal Processing Conference (EUSIPCO). IEEE, 2012.
- Wang, Siying, Sebastian Ewert, and Simon Dixon. "Identifying missing and extra notes in piano recordings using score-informed dictionary learning." IEEE/ACM Transactions on Audio, Speech, and Language Processing 25.10 (2017): 1877-1889.
- “Non-negative matrix factorization.” Wikipedia, Wikimedia Foundation, 2022, https://en.wikipedia.org/wiki/Non-negative_matrix_factorization.
- Hawthorne, Curtis, et al. "Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset." International Conference on Learning Representations. 2018.