DLINE Journals

Journal of Digital Information Management

Vol No. 131 ,Issue No. 20 2022

Deep Learning Model CNN With LSTM For Speaker Recognition

Bassel Alkhatib Mohammad Madian Kamal Eddin
Web Master Director- Syrian Virtual University Damascus Syria and the Faculty of Information Technology Engineering-Damascus University, Syria., 2Student at the PhD program- Syrian Virtual University Damascus Syria

Abstract: Speech recognition is one of the most important research fields nowadays because of its necessity in our daily lives and to raise the fields of security to the highest level, Itâ€™s a task of speech processing, and our main scope in this paper is on speaker verification, which is to identify persons from their voices where the process depends on digitizing the sound waves into a form that allows the system to deal with it. The verification process is based on the characteristics of the speaker's voice (voice biometrics) and sends it to a further process to extract the features of that voice using the feature extraction method and using AI techniques to perform the task of identification. MFCC is used for the task of features extraction and obtains the spectrogram of a given voice signal where it represents a bank of information about the voice and sends it to the CNN model for further processing for training the model on that signal to verify if the voice belongs to a user in the system or itâ€™s a new enrollment.

Keywords: ASR, Speech Verification, MFCC, CNN Deep Learning Model CNN With LSTM For Speaker Recognition

DOI:https://doi.org/10.6025/jdim/2022/20/4/131-147

Full_Text PDF 4.37 MB Download: 90 times

References:

[1] Speaker independent connected speech recognition, Retrieved from Fifthgen. Fifthgen.com (15 June, 2013).

[2] Bimbot, F.J., Bonastre, F., Fredouille, C., Gravier, G. & Magrin-Chagnolleau, I. (2004) A tutorial on text-independent speaker verification. EURASIP Journal on Applied Signal Processing, 430–451.

[3] Reynolds, D.A.A. (1992) Gaussian mixture modeling approach to text. Independent Speaker Identification.

[4] Kinnunen, T. & Li, H. (2010) An overview of text-independent speaker recognition from features to super vectors. Speech Communication, 52, 12–40.

[5] Torfi, A., Dawson, J. & Nasrabadi, N.M. (2018) Text- independent speaker verification using 3D convolutional neural networks, IEEE International Conference on Multimedia and Expo (ICME), San Diego, CA, pp. 1–6.

[6] Jung, J.w., Heo, H.S., Kim, J.h., Shim, H.j. & Yu, H.j. (2019). Rawnet: Advanced end-to-end deep neural network using raw waveforms for text-independent speaker verification, arXiv Preprint ArXiv:1904.08104.

[7] Anand, P. et al. (2019). “Few Shot Speaker Recognition using Deep Neural Networks.” arXiv Preprint ArXiv:1904.08775.

[8] Vélez, I., Rascon, C. & Fuentes-Pineda, G. (2018). “One-Shot Speaker Identification for a Service Robot using a CNN-based Generic Verifier”, arXiv Preprint ArXiv:1809.04115.

[9] Mostafa, E. (2019). Advanced Intelligent Systems for Sustainable Development. Springer: Berlin, (AI2SD 2018).

[10] Rudresh, M.D., Latha, A.S., Suganya, J. & Nayana, C.G. Performance analysis of speech digit recognition using cepstrum and vector quantization. IEEE, Computer and Optimization Techniques (ICEECCOT) (15 Dec 2017).

[11] Hinton, G., Deng, L., Yu, D., Dahl, G.E., Mohamed, A.-r., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T.N. & Kingsbury, B. (2012) Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine. IEEE Publications, 29, 82–97.

[12] Glorot, X. & Bengio, Y. (2010) Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 249–256.

[13] Lecun, Y., Bottou, L., Bengio, Y. & Haffner, P. (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE. IEEE Publications, 86, 2278–2324.

[14] https://www.mathworks.com/discovery/convolutional-neural-network-matlab.html.

[15] “A Hybrid of Deep CNN and Bidirectional LSTM for Automatic Speech Recognition”. (degruyter.com).

[16] Mostafa, E. (2019) “Advanced Intelligent Systems for Sustainable Development”, springer. Available from: https://www.springer.com/gp/book/9783030119270.

[17] Karpov, E. (2003). Real-Time Speaker Identification [University of Joensuu, Department of Computer Science Master’s Thesis].

[18] https://theaisummer.com/speech-recognition/convolutional-models.

[19] Abdel-Hamid, O., Mohamed, A.-r., Jiang, H., Deng, L., Penn, G. & Yu, D. (2014) Convolutional neural networks for speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22, 1533–1545.

[20] Wen-kai Lu & Qiang Zhang (2009) Deconvolutive short-time Fourier transform spectrogram. IEEE Signal Processing Letters, 16, 576–579.

[21] Li, B. (2011) On identity authentication technology of distance education system based on voiceprint recognition. In: Proceedings of the 30th Chinese Control Conference, Yantai, China, Vols. 22–24, pp. 5718–5721.

[22] Gowdy, J.N. & Tufekci, Z. “Mel-scaled discrete wavelet coefficients for speech recognition”, in Proceedings of the 2000 IEEE International Conference on Acoustics [Speech], and Signal Processing. Proceedings (Cat. No. 00CH37100). Istanbul, Turkey (5–9 June 2000), Volume 3, pp. 1351–1354.

[23] Huangkang, C. & Ying, C. (2019) Speaker identification based on multimodal long short-term memory with depth- gate, [J]. Laser and Optoelectronics Progress, 56, 031007.

[24] Miao, X. & Mcloughlin, I. (2019) Multi- Genre Broad-cast Challenge[J] LSTM-TDNN with convolutional front-end for Dialect Identification, in the, 2019.

[25] https://medium.com/techiepedia/binary-image-classifier-cnn-using-tensorflow-a3f5d6746697.

[26] https://www.gosmar.eu/machinelearning/2020/05/25/neural-networks-and-speech-recognition/.

[27] Shan, Shuaijie, Liu, J. & Dun, Y. (2021) Prospect of voiceprint recognition based on deep learning. Journal of Physics: Conference Series, 1848, 012046.

[28] Tucci, L., Lutkevich, B.,” A guide to artificial intelligence in the enterprise: natural language processing (NLP). Tech Accelerator (09 Jul 2021).

[29] Farrell, K.R., Mammone, R.J., Assaleh, K.T. (1994) Speaker networks recognition using neural and conventional classifiers. IEEE Transactions on Speech and Audio Processing, 2, 194–205.

[30] Martinez, J., Perez, H., Escamilla, E. (2018). Speaker Recognition Using Mel Frequency Cepstral Coefficients (MFCC) and Vector Quantization (VQ) Techniques, Vol. 4. IEEE Publications, pp. 978–971-61284-1325-5/12/.

[31] Ma, Z., Yu, H., Tan, Z.H. & Guo, J. (2016) Text-inde pendent speaker identification using the histogram transform model. IEEE Access, 4, 9733–9739

[32] Almaadeed, N., Aggoun, A. & Amira, A. (2015) Speaker identification using multimodal neural networks and wavelet analysis. IET Biometrics, 4, 18–28.

[33] Emre, C.¸ akir, Giambattista Parascandolo, Toni Heittola (2017) Heikki Huttunen, and Tuomas Virtanen, Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25, 1291–1303.

[34] Ranjan, Kumari R., Mahto, K., Kumari, D. & Solanki, S.S. (2017) Singer Identification using MFCC and LPC and its comparison for ANN and Naïve Bayes Classifiers. International Journal of Latest Engineering Research and Applications (IJLERA), 02, (I04): PP – 25-30.

[35] Singh, T. (2019) MFCC’s made easy. https://medium.com/@tanveer9812/mfccs-made-easy-7ef383006040.

[36] Pejman, M., Ku, J., Johannes, S. & Florian, M. (2016) Single channel phase-aware signal processing in speech communication. Theory into Practice. Wiley: Chichester, UK, 53–55.

[37] Verteletskaya, E., Sakhnov, K. (2010). Voice Activity Detection for Speech Enhancement Applications. Acta Polytechnica, 50 No. 4-2010.