Non-parametric Approach to Voice Morphing (1 October 2016 – 30 September 2019)

Voice conversion is a technique to change one’s voice to sound like that of another. A speaker’s voice identity is typically characterized at several levels, that include spectral, prosodic, and lexical levels. Many studies were focused on frame-by-frame spectral mapping, i.e. conversion of the timbre characteristics of the source to those of the target speaker. Unfortunately, there have been few studies on the conversion of prosodic features yet. The prosodic characteristics are manifested over a longer range of speech signal than a speech frame. Therefore, the transfer of prosodic features requires an effective super-segmental model, that remains as a challenge.

In this project, we study voice conversion frameworks that learn from both parallel and non-parallel utterances. We study how to optimize the voice quality in terms of spectral and prosodic similarity. We also study deep learning approaches to voice conversion for monolingual and cross-lingual voice conversion applications.

Project Duration: 1 October 2016 – 30 September 2019

Funding Source: National University of Singapore

PUBLICATIONS

Journal Articles

  • Chitralekha Gupta, Haizhou Li and Ye Wang, “Automatic Leaderboard: Evaluation of Singing Quality Without a Standard Reference,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28, 2020, pp. 13-26. [link] [Article In-process]
  • Berrak Sisman, Mingyang Zhang and Haizhou Li, “Group Sparse Representation with WaveNet Vocoder Adaptation for Spectrum and Prosody Conversion”, IEEE/ACM Trans. Audio, Speech & Language Processing, 27(6), June 2019, pp. 1085-1097. [link]
  • Karthika Vijayan, Haizhou Li and Tomoki Toda, “Speech-to-Singing Voice Conversion: The Challenges and Strategies for Improving Vocal Conversion Processes,” IEEE Signal Processing Magazine, 36(1), January 2019, pp. 95-102. [link]
  • Xiaohai Tian, Siu Wa Lee, Zhizheng Wu, Eng Siong Chng and Haizhou Li, “An Exemplar-Based Approach to Frequency Warping for Voice Conversion, IEEE/ACM Trans. Audio, Speech & Language Processing”, 25(10), October 2017, pp. 1863-1876. [link]

Conference Articles

  • Yi Zhou, Xiaohai Tian, Emre Yılmaz, Rohan Kumar Das and Haizhou Li, “A Modularized Neural Network with Language-Specific Output Layers for Cross-Lingual Voice Conversion”, in Proc. IEEE Automatic Speech Recognition Understanding (ASRU) Workshop 2019, Sentosa Island, Singapore, December 2019. [link]
  • Yi Zhou, Xiaohai Tian, Rohan Kumar Das and Haizhou Li, “Many-to-many Cross-lingual Voice Conversion with a Jointly Trained Speaker Embedding Network”, in Proc. Asia-Pacific Signal and Information Processing Association (APSIPA) Annual Summit and Conference (ASC) 2019, Lanzhou, China, November 2019. [link]
  • Karthika Vijayan, Kodukula Sri Rama Murty and Haizhou Li, “Allpass Modeling of Phase Spectrum of Speech Signals for Format Tracking”, in Proc. Asia-Pacific Signal and Information Processing Association (APSIPA) Annual Summit and Conference (ASC) 2019, Lanzhou, China, November 2019. [link]
  • Chitralekha Gupta, Emre Yılmaz and Haizhou Li, “Acoustic Modeling for Automatic Lyrics-to-Audio Alignment”, in Proc. INTERSPEECH, Graz, Austria, September 2019, pp. 2040-2044. [link]
  • Xiaohai Tian, Eng Siong Chng and Haizhou Li, “A Speaker-Dependent WaveNet for Voice Conversion with Non-Parallel Data”, in Proc. INTERSPEECH, Graz, Austria, September 2019, pp. 201-205. [link]
  • Xiaoxue Gao, Xiaohai Tian, Rohan Kumar Das, Yi Zhou and Haizhou Li, “Speaker-Independent Spectral Mapping for Speech-to-Singing Conversion”, in Proc. Asia-Pacific Signal and Information Processing Association (APSIPA) Annual Summit and Conference (ASC) 2019, Lanzhou, China, November 2019. [link]
  • Bidisha Sharma and Haizhou Li, “A Combination of Model-based and Feature-based Strategy for Speech-to-Singing Alignment”, in Proc. INTERSPEECH, Graz, Austria, September 2019, pp. 624-628. [link]
  • Chitralekha Gupta, Karthika Vijayan, Bidisha Sharma, Xiaoxue Gao and Haizhou Li, “NUS Speak-to-Sing: A Web Platform for Personalized Speech-to-Singing Conversion”, in Proc. INTERSPEECH, Graz, Austria, September 2019, pp. 2376-2377. [link]
  • Yi Zhou, Xiaohai Tian, Haihua Xu, Rohan Kumar Das and Haizhou Li “Cross-Lingual Voice Conversion with Bilingual Phonetic Posteriorgram and Average Modeling” in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Brighton, United Kingdom, May 2019, pp. 6790-6794. [link]
  • Karthika Vijayan, Xiaoxue Gao and Haizhou Li, “Analysis of Speech and Singing Signals for Temporal Alignment,” in Proc. Asia-Pacific Signal and Information Processing Association (APSIPA) Annual Summit and Conference (ASC) 2018, Honolulu, Hawaii, USA, November 2018, pp. 1893-1898. [link]
  • Mingyang Zhang, Berrak Sisman, Sai Sirisha Rallabandi, Haizhou Li and Li Zhao, “Error Reduction Network for DBLSTM-based Voice Conversion,” in Proc. Asia-Pacific Signal and Information Processing Association (APSIPA) Annual Summit and Conference (ASC) 2018, Honolulu, Hawaii, USA, November 2018, pp. 823-828. [link]
  • Xiaoxue Gao, Berrak Sisman, Rohan Kumar Das and Karthika Vijayan “NUS-HLT Spoken Lyrics and Singing (SLS) Corpus” in Proc. International on Orange Technologies (ICOT) 2018, Bali, Indonesia, October 2018. [link]
  • Berrak Sisman and Haizhou Li, “Wavelet Analysis of Speaker Dependent and Independent Prosody for Voice Conversion,” in Proc. INTERSPEECH, Hyderabad, India, September 2018, pp. 52-56. [link]
  • Berrak Sisman, Mingyang Zhang and Haizhou Li, “A Voice Conversion Framework with Tandem Feature Sparse Representation and Speaker-Adapted WaveNet Vocoder,” in Proc. INTERSPEECH, Hyderabad, India, September 2018, pp. 1978-1982. [link]
  • Berrak Sisman, Mingyang Zhang, Sakriani Sakti, Haizhou Li and Satoshi Nakamura, “Adaptive Wavenet Vocoder for Residual Compensation in GAN-Based Voice Conversion,” in Proc. IEEE Spoken Language Technology (SLT), Athens, Greece, December 2018, pp. 282-289. [link]
  • Berrak Sisman, Grandee Lee and Haizhou Li, “Phonetically Aware Exemplar-Based Prosody Transformation,” in Proc. The Speaker and Language Recognition Workshop-Odyssey, Les Sables D’olonne, France, June 2018, pp. 267-274. [link]
  • Xiaohai Tian, Juchao Wang, Haihua Xu, Eng Siong Chng and Haizhou Li, “Average Modeling Approach to Voice Conversion with Non-Parallel Data,” in Proc. Odyssey 2018: The Speaker and Language Recognition Workshop, Les Sables d’Olonne, France, June 2018, pp. 227-232. [link]
  • Karthika Vijayan, Haizhou Li, Hanwu Sun and Kong-Aik Lee, “On the Importance of Analytic Phase of Speech Signals in Spoken Language Recognition,” in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Calgary, Alberta, Canada, April 2018, pp. 5194-5198. [link]
  • Berrak Sisman, Haizhou Li and Kay Chen Tan, “Transformation of Prosody in Voice Conversion,” in Proc. Asia-Pacific Signal and Information Processing Association (APSIPA) Annual Summit and Conference (ASC), Kuala Lumpur, Malaysia, December 2017, pp. 1537-1546. [link]
  • Berrak Sisman, Haizhou Li and Kay Chen Tan, “Sparse Representation of Phonetic Features for Voice Conversion with and Without Parallel Data”, IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Okinawa, Japan, December 2017, pp. 677-684. [link]
  • Berrak Sisman, Grandee Lee, Haizhou Li and  Kay Chen Tan, “On the Analysis and Evaluation of Prosody Conversion Techniques,” International Conference on Asian Language Processing (IALP), Singapore, December 2017, pp. 44-47. [link]
  • Karthika Vijayan, Minghui Dong and Haizhou Li, “A Dual Alignment Scheme for Improved Speech-to-Singing Voice Conversion,” in Proc. Asia-Pacific Signal and Information Processing Association (APSIPA) Annual Summit and Conference (ASC), Kuala Lumpur, Malaysia, December 2017, pp. 1547-1555. [link]
  • Jie Wu, D.-Y. Huang, Lei Xie and Haizhou Li, “Denoising Recurrent Neural Network for Deep Bidirectional LSTM Based Voice Conversion, in Proc. INTERSPEECH, Stockholm, Sweden, August 2017, pp. 3379-3383. [link]