Lyrics Analysis from Polyphonic Music Using Sparse Auto encoders and Pitch Salience Analysis

Lyrics Analysis from Polyphonic Music Using Sparse Auto encoders and Pitch Salience Analysis Journal of Intelligent Computing Xiaoyu Zhang 16 4 2025 https://doi.org/10.6025/jic/2025/16/4/156-162 https://www.dline.info/jic/fulltext/v16n4/jicv16n4_3.pdf This work presents a study on extracting vocal melodies from polyphonic music using advanced signal processing and deep learning techniques. It emphasizes the importance of accurately identifying the main melody typically carried by the human voice for applications such as music information retrieval, cover song identification, and copyright protection. The proposed method involves several stages: signal preprocessing (including down sampling, normalization, and Short Time Fourier Transform), note segmentation using the DIS algorithm, and multiple fundamental frequency (F0) estimation within the 70-1000 Hz range. A key innovation is the use of a sparse auto encoder neural network (SAENN) combined with a softmax classifier to distinguish vocal melody from instrumental accompaniment. Trained and tested on the MIR-1K dataset, the model achieves over 85.1% recognition accuracy. Experimental results show that the approach reduces melody localization errors and decreases average extraction time by 0.12 seconds compared to traditional methods. The study also introduces the Average Extraction Time (AET) as a new metric for evaluating computational efficiency. Overall, integrating improved pitch salience computation with deep learning significantly enhances both accuracy and processing speed in melody extraction, offering a robust solution for real-world audio analysis tasks. The findings suggest promising directions for future work in music signal processing and vocal separation technologies.