

<?xml version="1.0" encoding="UTF-8"?>
<record>
  <title>Lyrics Analysis from Polyphonic Music Using Sparse Auto encoders and Pitch Salience Analysis</title>
  <journal>Journal of Intelligent Computing</journal>
  <author>Xiaoyu Zhang</author>
  <volume>16</volume>
  <issue>4</issue>
  <year>2025</year>
  <doi>https://doi.org/10.6025/jic/2025/16/4/156-162</doi>
  <url>https://www.dline.info/jic/fulltext/v16n4/jicv16n4_3.pdf</url>
  <abstract>This work presents a study on extracting vocal melodies from polyphonic music using advanced signal
processing and deep learning techniques. It emphasizes the importance of accurately identifying the main
melody typically carried by the human voice for applications such as music information retrieval, cover
song identification, and copyright protection. The proposed method involves several stages: signal
preprocessing (including down sampling, normalization, and Short Time Fourier Transform), note segmentation
using the DIS algorithm, and multiple fundamental frequency (F0) estimation within the 70-1000
Hz range. A key innovation is the use of a sparse auto encoder neural network (SAENN) combined with a
softmax classifier to distinguish vocal melody from instrumental accompaniment. Trained and tested on the
MIR-1K dataset, the model achieves over 85.1% recognition accuracy. Experimental results show that the
approach reduces melody localization errors and decreases average extraction time by 0.12 seconds
compared to traditional methods. The study also introduces the Average Extraction Time (AET) as a new
metric for evaluating computational efficiency. Overall, integrating improved pitch salience computation
with deep learning significantly enhances both accuracy and processing speed in melody extraction, offering
a robust solution for real-world audio analysis tasks. The findings suggest promising directions for future
work in music signal processing and vocal separation technologies.</abstract>
</record>
