Masters Thesis

Feature extraction and machine learning techniques for musical genre determination

Since 2015, the music industry has experienced a resurgence driven by online music sales and streaming, which has in turn been facilitated by very large archives of musical data. These large musical archives, however, remain challenging to search and index effectively, due to the scale of the data involved and the subjective, perceptual nature of how humans relate to music. Contemporary research in music information retrieval seeks to bridge this gap by using algorithmic analysis on features extracted from the underlying audio to automatically classify and identify perceptual features in music. This project applied three machine learning techniques (support vector classification, traditional neural networks, and convolutional neural networks) to two sets of audio features (Mel-frequency cepstral coefficients and the discrete wavelet transform) for the purposes of genre classification. Because convolutional neural networks have been used on images to great effect, the discrete wavelet transform data was used to map audio into the image domain, to leverage publicly available, pre-trained weight sets for four large, sophisticated image recognition networks. For all tasks, two subsets of a large, publicly available musical dataset were used, along with multiple training and optimization techniques. While all models were able to meet or exceed some pre-existing benchmarks for the genre classification task, support vector classification was found to yield better results, with a best overall test set accuracy of 61%, than either traditional neural networks (51.4%) or convolutional neural networks (40.5%) on an eight-genre multi-class classification task. The application of the pre-trained image recognition networks to audio wavelet data decreased training time, but was not found to yield accuracies comparable to the accuracies those networks achieved on image data. The small size of the dataset relative to datasets in other domains, the reuse of data augmentation techniques intended for use on images, and sub-optimal feature extraction techniques are suggested as factors in the inability of the machine-learning models evaluated in this project to achieve the quality of results observed in the image domain. Audio-native augmentation techniques and the use of ensemble models present worthwhile avenues for future investigation.

Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.