Project 1 – Electrical and Computer Engineering

Human Robot Interaction Phase 1

In this project, we study novel algorithms that integrate machine listening intelligence into robotic audition that includes audio-visual sound localization to provide both accurate direction and distance information; speaker-recognition for robots to only respond to intended speakers; and integrated audition solutions for far-field speech acquisition. In the meantime, we will also investigate a novel end-to-end speech-to-action conversion. In this way, we allow the users to speak to the robots in free form speech and even in different languages

Project Duration: 1 October 2019 – 30 March 2023.

Funding Source: National Robotics Programme – Robotic Enabling Capabilities and Technologies, Grant No. 192 25 00054.

Acknowledgment: This work was supported by the Science and Engineering Research Council, Agency of Science, Technology and Research, Singapore, through the National Robotics Program under Grant No. 192 25 00054.

PUBLICATIONS

Journal Articles

Kun Zhou; Berrak Sisman; Rajib Rana; Björn W. Schuller; Haizhou Li, "Speech Synthesis With Mixed Emotions," in IEEE Transactions on Affective Computing, vol. 14, no. 4, pp. 3120-3134, 1 Oct.-Dec. 2023, doi: 10.1109/TAFFC.2022.3233324
Xinyuan Qian, Zhengdong Wang, Jiadong Wang, Guohui Guan, & Haizhou Li. "Audio-Visual Cross-Attention Network for Robotic Speaker Tracking", IEEE/ACM Transactions on Audio, Speech, and Language Processing. vol. 31, pp. 550-562 2022, DOI 10.1109/TASLP.2022.3226330.
Qiquan Zhang, Xinyuan Qian, Zhaoheng Ni, Aaron Nicolson, Eliathamby Ambikairajah, & Haizhou Li, "A Time-Frequency Attention Module for Neural Speech Enhancement", IEEE/ACM Transactions on Audio, Speech, and Language Processing. vol. 31, pp. 462-475, 2023, DOI: 10.1109/TASLP.2022.3225649.
Zexu Pan, Meng Ge, Haizhou Li, "USEV: Universal Speaker Extraction with Visual Cue", in IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 30, pp. 3032-3045, 2022, DOI 10.1109/TASLP.2022.3205759. [link]
Kun Zhou, Berrak Sisman, Rajib Rana, B.W. Schuller, Haizhou Li, “Emotion Intensity and its Control for Emotional Voice Conversion”, IEEE Transactions on Affective Computing, 2022, DOI 10.1109/TAFFC.2022.3175578 [Article In-Process] [link]
Z. Pan, X. Qian and H. Li, "Speaker Extraction with Co-Speech Gestures Cue", in IEEE Signal Processing Letters, vol. 29, pp. 1467-1471, 2022, doi: 10.1109/LSP.2022.3175130 [link]
Rui Liu, Berrak Sisman, Guanglai Gao, Haizhou Li, "Decoding Knowledge Transfer for Neural Text-to-Speech Training", IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 1789-1802, 2022, DOI: 10.1109/TASLP.2022.3171974. [link]
Z. Pan, R. Tao, C. Xu and H. Li, "Selective Listening by Synchronizing Speech With Lips," in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 1650-1664, 2022, doi: 10.1109/TASLP.2022.3153258. [link]
Chenglin Xu, Wei Rao, Jibin Wu, and Haizhou Li, “Target Speaker Verification with Selective Auditory Attention for Single and Multi-talker Speech”, IEEE / ACM Transactions on Audio, Speech, and Language Processing, July 2021. [link]
Qu Yang, Jibin Wu, and Haizhou Li, “Rethinking Benchmarks for Neuromorphic Learning Algorithms”, The International Joint Conference on Neural Networks (IJCNN), Virtual Event, July 2021. [link] [Article In-process]
Xinyuan Qian, Qi Liu, Jiadong Wang, and Haizhou Li, “Three-dimensional Speaker Localization: Audio-refined Visual Scaling Factor Estimation”, IEEE Signal Processing Letters, July 2021. [link] [Article In-process]
Chen Zhang, Grandee Lee, Luis Fernando D’Haro, and Haizhou Li, “D-score: Holistic Dialogue Evaluation without Reference”, in IEEE/ACM Transactions on Audio, Speech and Language Processing, April 2021. [link] [Article In-process]
Rui Liu, Berrak Sisman, Guanglai Gao and Haizhou Li, “Expressive TTS Training with Frame and Style Reconstruction Loss”, IEEE/ACM Transactions on Audio, Speech, and Language Processing, April 2021, pp. 1-13. [link] [Article In-process]
Rui Liu, Berrak Sisman, Yixing Lin and Haizhou Li, “FastTalker: A Neural Text-to-Speech Architecture with Shallow and Group Autoregression”, Neural Networks, April 2021. [link] [Article In-process]
Mingyang Zhang, Yi Zhou, Li Zhao, and Haizhou Li, “Transfer learning from speech synthesis to voice conversion with non-parallel training data,” in IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29, March 2021, pp. 1290-1302. [link] [Article In-process]
Jichen Yang, Hongji Wang, Rohan Kumar Das, and Yanmin Qian, “Modified Magnitude-phase Spectrum Information for Spoofing Detection”, in IEEE/ACM Transactions on Audio, Speech and Language Processing, 29, February 2021, pp. 1065-1078. [link] [Article In-process]
Zhixuan Zhang and Qi Liu, “Spike-event-driven deep spiking neural network with temporal encoding”, IEEE Signal Processing Letters, 28, 2021, pp. 484-488. [link] [Article In-process]
Berrak Sisman, Junichi Yamagishi, Simon King, and Haizhou Li, “An Overview of Voice Conversion and its Challenges: From Statistical Modeling to Deep Learning”, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29, 2021, pp. 132-157. [link] [Article In-process]
Rui Liu, Berrak Sisman, Feilong Bao, Jichen Yang, Guanglai Gao and Haizhou Li, “Exploiting morphological and phonological features to improve prosodic phrasing for Mongolian speech synthesis” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29, 2021, pp. 274-285. [link] [Article In-process]
Qi Liu and Jibin Wu, “Parameter tuning-free missing-feature reconstruction for robust sound recognition”, IEEE Journal of Selected Topics in Signal Processing, 15(1), January 2021, pp. 78-89. [link]
Yi Zhou, Xiaohai Tian and Haizhou Li, “Multi-Task WaveRNN with an Integrated Architecture for Cross-lingual Voice Conversion”, IEEE Signal Processing Letters, 27, 2020, pp 1310-1314. [link]
Mingyang Zhang, Berrak Sisman, Li Zhao and Haizhou Li, “DeepConversion: Voice conversion with limited parallel training data”, Speech Communication, 122, 2020, pp. 31-43. [link]

Conference Articles

Peiwen Li, Enze Su, Jia Li, Siqi Cai, Longhan Xie, and Haizhou Li, "ESAA: An Eeg-Speech Auditory Attention Detection Database", 25th Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA), Hanoi, Vietnam, November 24-26, 2022, pp. 1-6, doi: 10.1109/O-COCOSDA202257103.2022.9997944.
Rui Liu, Berrak Sisman, Bj ̈orn W. Schuller, Guanglai Gao, Haizhou Li, "Accurate Emotion Strength Assessment for Seen and Unseen Speech Based on Data-Driven Deep Learning", in Proc. Interspeech 2022, Songdo ConvensiA, in Incheon, Korea, September 18 to 22, 2022
Zexu Pan, Meng Ge, Haizhou Li, "A Hybrid Continuity Loss to Reduce Over-Suppression for Time-domain Target Speaker Extraction", in Proc. Interspeech 2022, Songdo ConvensiA, in Incheon, Korea, September 18 to 22, 2022.
Zongyang Du, Berrak Sisman, Kun Zhou and Haizhou Li, “Disentanglement of Emotional Style and Speaker Identity for Expressive Voice Conversion”, in Proc. Interspeech 2022, Songdo ConvensiA, in Incheon, Korea, September 18 to 22, 2022.
Bin Wang, C.-C. Jay Kuo, and Haizhou Li, "Rethinking Evaluation with Word and Sentence Similarities". In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6060–6077, May 22-27, 2022, Dublin, (Ireland). [link]
Chen Zhang, Luis Fernando D’Haro, Thomas Friedrichs and Haizhou Li, “MDD-Eval: Self-Training on Augmented Data for Multi-Domain Dialogue Evaluation”, In. Proc. Thirty-Six AAAI Conference on Artificial Intelligence (AAAI-22), Virtual Event, 2022.
Yan Zhang, Ruidan He, Zuozhu Liu, Lidong Bing, and Haizhou Li, “Bootstrapped Unsupervised Sentence Representation Learning”, ACL, August 2021, pp. 5168–5180. [link]
Xinyuan Qian, Bidisha Sharma, Amine El Abridi and Haizhou Li, “SLoClas: A DATABASE FOR JOINT SOUND LOCALIZATION AND CLASSIFICATION”, in Proc. O-COCOSDA 2021, 18-20 November 2021, Singapore. Best Paper Award. [link]
Yi Ma, Kong Aik Lee, Ville Hautamaki, and Haizhou Li, “PL-EESR: Perceptual Loss Based End-to-End Robust Speaker Representation Extraction”, in Proc. IEEE Automatic Speech Recognition and Understanding (ASRU) Workshop, Cartagena, Colombia, September 2021.
Jiadong Wang, Xinyuan Qian, Zihan Pan, Malu Zhang, and Haizhou Li, “GCC-PHAT with Speech-oriented Attention for Robotic Sound Source Localization”, in Proc. IEEE International Conference on Robotics and Automation (ICRA), Xian, China, 2021.
Chen Zhang, Yiming Chen, Luis Fernando D’Haro, Yan Zhang, Thomas Friedrichs, Grandee Lee and Haizhou Li, “DynaEval: Unifying Turn and Dialogue Level Evaluation”, in Proc. Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP), August 2021. [link]
Rohan Kumar Das, Jichen Yang, and Haizhou Li, “Data Augmentation with Signal Companding for Detection of Logical Access Attacks” in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Toronto, Ontario, Canada, June 2021. [link]
Kun Zhou, Berrak Sisman, and Haizhou Li, “VAW-GAN for disentanglement and recomposition of emotional elements in speech,” in Proc. IEEE Spoken Language Technology (SLT), Shenzhen, China, January 2021. [link]
Hongqiang Du, Xiaohai Tian, Lei Xie, and Haizhou Li, “Optimizing voice conversion network with cycle consistency loss of speaker identity” in Proc. IEEE Spoken Language Technology (SLT), Shenzhen, China, January 2021. [link]
Meidan Ouyang, Rohan Kumar Das, Jichen Yang and Haizhou Li, “Capsule Network based End-to-end System for Detection of Replay Attacks”, in Proc. International Symposium on Chinese Spoken Language Processing (ISCSLP) 2021, Hong Kong, January 2021, pp. 1-5. [link]
Rohan Kumar Das and Haizhou Li, “Classification of Speech with and without Face Mask using Acoustic Features” in Proc. Asia-Pacific Signal and Information Processing Association (APSIPA) Annual Summit and Conference (ASC), Auckland, New Zealand, December 2020, pp. 747-752. [link]
Rohan Kumar Das, Ruijie Tao, Jichen Yang, Wei Rao, Cheng Yu, and Haizhou Li, “HLT-NUS Submission for NIST 2019 Multimedia Speaker Recognition Evaluation”, in Proc. Asia-Pacific Signal and Information Processing Association (APSIPA) Annual Summit and Conference (ASC), Auckland, New Zealand, December 2020, pp. 605-609. [link]
Junchen Lu, Kun Zhou, Berrak Sisman, and Haizhou Li, ” VAW-GAN for Singing Voice Conversion with Non-parallel Training Data”, in Proc. Asia-Pacific Signal and Information Processing Association (APSIPA) Annual Summit and Conference (ASC), Auckland, New Zealand, December 2020, pp. 514-519. [link]
Zongyang Du, Kun Zhou, Berrak Sisman, and Haizhou Li, “Spectrum And Prosody Conversion for Cross-Lingual Voice Conversion with Cyclegan”, in Proc. Asia-Pacific Signal and Information Processing Association (APSIPA) Annual Summit and Conference (ASC), Auckland, New Zealand, December 2020, pp. 507-513. [link]
Biswajit Dev Sarma and Rohan Kumar Das, “Emotion Invariant Speaker Embeddings for Speaker Identification with Emotional Speech” in Proc. Asia-Pacific Signal and Information Processing Association (APSIPA) Annual Summit and Conference (ASC), Auckland, New Zealand, December 2020, pp. 610-615. [link]
Yi Zhou, Xiaohai Tian, Xuehao Zhou, Mingyang Zhang, Grandee Lee, Rui Liu, Berrak Sisman, and Haizhou Li, “NUS-HLT System for Blizzard Challenge 2020”, in Proc. Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge, Shanghai, China, October 2020, pp. 44-48. [link]
Xiaohai Tian, Zhichao Wang, Shan Yang, Xinyong Zhou, Hongqiang Du, Yi Zhou, Mingyang Zhang, Kun Zhou, Berrak Sisman, Lei Xie, and Haizhou Li, “The NUS & NWPU system for Voice Conversion Challenge 2020”, in Proc. Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020, Shanghai, China, October 2020, pp. 170-174. [link]
Zhao Yi, Wen-Chin Huang, Xiaohai Tian, Junichi Yamagishi, Rohan Kumar Das, Tomi Kinnunen, Zhenhua Ling, and Tomoki Toda, “Voice Conversion Challenge 2020 – Intra-lingual semi-parallel and cross-lingual voice conversion –”, in Proc. Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge, Shanghai, China, October 2020, pp. 80-98. [link]
Rohan Kumar Das, Tomi Kinnunen, Wen-Chin Huang, Zhenhua Ling, Junichi Yamagishi, Yi Zhao, Xiaohai Tian, and Tomoki Toda, “Predictions of Subjective Ratings and Spoofing Assessments of Voice Conversion Challenge 2020 Submissions”, in Proc. Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge, Shanghai, China, October 2020, pp. 99-120. [link]
Xinyuan Zhou, Emre Yılmaz, Yanhua Long, Yijie Li and Haizhou Li, “Multi-Encoder-Decoder Transformer for Code-Switching Speech Recognition,” in Proc. INTERSPEECH, Shanghai, China, October 2020, pp. 1042-1046. [link]
Xinyuan Zhou, Grandee Lee, Emre Yılmaz, Yanhua Long, Jiaen Liang and Haizhou Li, “Self-and-Mixed Attention Decoder with Deep Acoustic Structure for Transformer-based LVCSR,” in Proc. INTERSPEECH, Shanghai, China, October 2020, pp. 5016-5020. [link]
Kun Zhou, Berrak Sisman, Mingyang Zhang and Haizhou Li, “Converting Anyone’s Emotion: Towards Speaker-Independent Emotional Voice Conversion”, in Proc. INTERSPEECH, Shanghai, China, October 2020, pp. 3416-3420. [link]
Shoufeng Lin and Xinyuan Qian, “Audio-Visual Multi-Speaker Tracking Based On the GLMB Framework”, in Proc. INTERSPEECH, Shanghai, China, October 2020, pp. 3082-3086. [link]
Xiaoyi Qin, Ming Li, Hui Bu, Wei Rao, Rohan Kumar Das, Shrikanth Narayanan and Haizhou Li, “The INTERSPEECH 2020 Far-Field Speaker Verification Challenge”, in Proc. INTERSPEECH, Shanghai, China, October 2020, pp. 3456-3460. [link]
Zhenzong Wu, Rohan Kumar Das, Jichen Yang and Haizhou Li, “Light Convolutional Neural Network with Feature Genuinization for Detection of Synthetic Speech Attacks”, in Proc. INTERSPEECH, Shanghai, China, October 2020, pp. 1101-1105. [link]
Ruijie Tao, Rohan Kumar Das and Haizhou Li, “Audio-visual Speaker Recognition with a Cross-modal Discriminative Network”, in Proc. INTERSPEECH, Shanghai, China, October 2020, pp. 2242-2246. [link]
Tianchi Liu, Rohan Kumar Das, Maulik Madhavi, Shengmei Shen and Haizhou Li, “Speaker-Utterance Dual Attention for Speaker and Utterance Verification”, in Proc. INTERSPEECH, Shanghai, China, October 2020, pp. 4293-4297. [link]
Nana Hou, Chenglin Xu, Joey Tianyi Zhou, Eng Siong Chng and Haizhou Li, “Multi-task Learning for End-to-end Noise-robust Bandwidth Extension”, in Proc. INTERSPEECH, Shanghai, China, October 2020, pp. 4069-4073. [link]
Nana Hou, Chenglin Xu, Van Tung Pham, Joey Tianyi Zhou, Eng Siong Chng and Haizhou Li, “Speaker and Phoneme-Aware Speech Bandwidth Extension with Residual Dual-Path Network”, in Proc. INTERSPEECH, Shanghai, China, October 2020, pp. 4064-4068. [link]
Grandee Lee and Haizhou Li, “Modeling Code-Switch Languages Using Bilingual Parallel Corpus”, in Association for Computational Linguistics, July 2020, pp. 860-870. [link]
Berrak Sisman and Haizhou Li, “Generative Adversarial Networks for Singing Voice Conversion with and without Parallel Data” in Proc. Speaker Odyssey, Tokyo, Japan, November 2020, pp. 238-244. [link]
Kun Zhou, Berrak Sisman and Haizhou Li, “Transforming Spectrum and Prosody for Emotional Voice Conversion with Non-Parallel Training Data” in Proc. Speaker Odyssey, Tokyo, Japan, November 2020, pp. 230-237. [link]
Rui Liu, Berrak Sisman, Feilong Bao, Guanglai Gao and Haizhou Li, “WaveTTS: Tacotron-based TTS with Joint Time-Frequency Domain Loss” in Proc. Speaker Odyssey, Tokyo, Japan, November 2020, pp. 245-251. [link]
Xiaohai Tian, Rohan Kumar Das and Haizhou Li, “Black-box Attacks on Automatic Speaker Verification using Feedback-controlled Voice Conversion” in Proc. Speaker Odyssey, Tokyo, Japan, November 2020, pp. 159-164. [link]
Xiaoxue Gao, Xiaohai Tian, Yi Zhou, Rohan Kumar Das and Haizhou Li, “Personalized Singing Voice Generation Using WaveRNN” in Proc. Speaker Odyssey, Tokyo, Japan, November 2020, pp. 252-258. [link]
Rui Liu, Berrak Sisman, Jingdong Li, Feilong Bao, Guanglai Gao and Haizhou Li, “Teacher-Student Training for Robust Tacotron-based TTS”, in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Barcelona, Spain, May 2020, pp. 6274-6278. [link]

Return to Main Page