Project 2 – Electrical and Computer Engineering

AI speech lab: Automatic Speech Recognition for Public Services

AI Singapore (AISG) has set up an AI Speech Lab to develop a speech recognition system that could interpret and process the unique vocabulary used by Singaporeans – including Singlish and dialects – in daily conversations. This automatic speech transcribing system could be deployed at various government agencies and companies to assist frontline officers in acquiring relevant and actionable information while they focus on interacting with customers or service users to address their queries and concerns.

Established as part of our 100 Experiments (100E) Programme, this new speech lab marks AISG’s first major collaboration with multiple government agencies to design an AI system that could be deployed government-wide and in future, nation-wide. Discussions with companies to deploy the system are also underway. Located at the innovation 4.0 building within the National University of Singapore’s Kent Ridge campus, the lab is operational from 1 July 2018.

“The AI Speech Lab came about as we had, over the last few months, received multiple 100E requests from agencies and companies for a colloquial Singaporean English (Singlish) speech-to-text engine. This is a challenge that is unique to Singapore and the region which is currently not addressed by existing speech engines offered commercially or by major cloud-based AI providers,” said Professor Ho Teck Hua, Executive Chairman of AI Singapore.

“The Government is keen to harness artificial intelligence to serve our citizens better. GovTech is collaborating with AI Singapore to develop solutions that can improve planning and service delivery. We are working with the AI Speech Lab on a speech-to-text engine for multi-language speech, for example, to transcribe 995 calls on-the-fly for faster response,” said Mr Tan Kok Yam, Deputy Secretary (Smart Nation and Digital Government).

The new research lab is led by Professor Li Haizhou, a world-renowned expert in Speech, Text and Natural Language Processing from the National University of Singapore, and Associate Professor Chng Eng Siong from the Nanyang Technological University.

They recently developed the world’s first code-switch (mixed-lingual) speech recognition engine using deep learning technology. This technological breakthrough represents a paradigm shift in speech recognition and understanding. The AI Speech Lab will adopt this advanced speech recognition technology to benefit Singapore.

The novel code-switch speech recognition engine can recognise speech that comprises a mix of English and Chinese words in the same sentence, as if they belong to the same language. Furthermore, to adapt to the local context, words in dialects such as ‘jiak ba bueh’ or ‘hoh boh’ (have you eaten or how are you in Hokkien) are also included into the engine lexicon.

“An automatic speech recognition system that is able to recognise a mix of languages in one conversation is currently not commercially available. This is because training a computer system to recognise different languages is a very complex and challenging task. Our recent technological breakthrough is the outcome of several years of research efforts in Singapore. This technology performs better than commercial engines as it can accurately recognize conversations comprising words from different languages and solves a unique Singapore problem,” Prof Li explained.

AI Singapore is investing S$1.25 million to set up this new lab. Agencies and companies are expected to match this investment, bringing the total funding to S$2.5 million over the next three years. The lab, which occupies a floor area of 125 square metres, will be staffed by five AI engineers for a start.

Maiden collaboration with SCDF to manage emergency calls :
The AI Speech Lab has secured its first collaborator – the Singapore Civil Defence Force (SCDF).

“The Singapore Civil Defence Force’s 995 Operations Centre receives close to 200,000 calls for assistance every year. When a call is received, our dispatchers need to ask some questions to determine the nature and severity of the case, to facilitate the deployment of appropriate emergency medical resources. In an emergency, every minute counts. The new speech recognition system, if successful, will help reduce the time needed to log in the information. This will improve how SCDF’s emergency medical resources are dispatched and enhance the overall health outcomes of those in need.” remarked Assistant Commissioner Daniel Seet, SCDF Director of Operations.

“This collaboration between the AI Speech Lab and SCDF is an important first step. The knowledge and experience acquired through this project will enable AI Singapore to expand the deployment of this novel speech recognition engine to address other needs in the public service. Whilst we are first rolling out this system to the public service, we are confident that the solution will also benefit companies as the system can be customised according to their business needs. ” said Prof Ho.

Project Duration: 1 July 2018 – 30 June 2023

Funding Source: AI Singapore 100 Experiments Program

PUBLICATIONS

Journal Articles

Kun Zhou, Berrak Sisman, Rajib Rana, Bjorn Schuller and Haizhou Li,“Speech Synthesis with Mixed Emotions”, IEEE Transactions on Affective Computing, 2023. [Article In-Process]
Kun Zhou, Berrak Sisman, Rajib Rana, B.W. Schuller, Haizhou Li, “Emotion Intensity and its Control for Emotional Voice Conversion”, IEEE Transactions on Affective Computing, 2022, DOI 10.1109/TAFFC.2022.3175578 [Article In-Process] [link]
Jibin Wu, Chenglin Xu, Xiao Han, Daquan Zhou, Malu Zhang, Haizhou Li and Kay Chen Tan, “Progressive Tandem Learning for Pattern Recognition with Deep Spiking Neural Networks”, TPAMI.2021.3114196, IEEE Transactions on Pattern Analysis and Machine Intelligence [lin k] [Article In-Process]
Chenglin Xu, Wei Rao, Jibin Wu, and Haizhou Li, “Target Speaker Verification with Selective Auditory Attention for Single and Multi-talker Speech”, July 2021. [link] [Article In-process]
Rui Liu, Berrak Sisman, Guanglai Gao and Haizhou Li, “Expressive TTS Training with Frame and Style Reconstruction Loss”, IEEE/ACM Transactions on Audio, Speech, and Language Processing, April 2021, pp. 1-13. [link] [Article In-process]
Rui Liu, Berrak Sisman, Yixing Lin and Haizhou Li, “FastTalker: A Neural Text-to-Speech Architecture with Shallow and Group Autoregression”, Neural Networks, April 2021. [link] [Article In-process]
Mingyang Zhang, Yi Zhou, Li Zhao, and Haizhou Li, “Transfer learning from speech synthesis to voice conversion with non-parallel training data,” in IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29, March 2021, pp. 1290-1302. [link] [Article In-process]
Berrak Sisman, Junichi Yamagishi, Simon King, and Haizhou Li, “An Overview of Voice Conversion and its Challenges: From Statistical Modeling to Deep Learning”, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29, 2021, pp. 132-157. [link] [Article In-process]
Rui Liu, Berrak Sisman, Feilong Bao, Jichen Yang, Guanglai Gao and Haizhou Li, “Exploiting morphological and phonological features to improve prosodic phrasing for Mongolian speech synthesis” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29, 2021, pp. 274-285. [link] [Article In-process]
Rui Liu, Berrak Sisman, Feilong Bao, Guanglai Gao and Haizhou Li, “Modeling Prosodic Phrasing with Multi-Task Learning in Tacotron-based TTS”, IEEE Signal Processing Letters, 27, 2020, pp. 1470-1474. [link] [Article In-process]
Yi Zhou, Xiaohai Tian and Haizhou Li, “Multi-Task WaveRNN with an Integrated Architecture for Cross-lingual Voice Conversion”, IEEE Signal Processing Letters, 27, 2020, pp 1310-1314. [link] [Article In-process]
Chang Huai You and Jichen Yang, “Device Feature Extraction Based on Parallel Neural network training for replay spoofing detection”, IEEE/ACM Transactions on Audio, Speech and Language Processing, 28, 2020, pp 2308-2318. [link] [Article In-process]
Mingyang Zhang, Berrak Sisman, Li Zhao and Haizhou Li, “DeepConversion: Voice conversion with limited parallel training data”, Speech Communication, 122, 2020, pp. 31-43. [link] [Article In-process]
Chenglin Xu, Wei Rao, Eng Siong Chng and Haizhou Li, “SpEx: Multi-Scale Time Domain Speaker Extraction Network”, IEEE/ACM Transaction on Audio, Speech, and Language Processing, 28, 2020, pp. 1370-1384. [link] [Article In-process]
Chitralekha Gupta, Haizhou Li and Ye Wang, “Automatic Leaderboard: Evaluation of Singing Quality Without a Standard Reference,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28, 2020, pp. 13-26. [link] [Article In-process]

Conference Articles

Peiwen Li, Enze Su, Jia Li, Siqi Cai, Longhan Xie, and Haizhou Li, "ESAA: An Eeg-Speech Auditory Attention Detection Database", 25th Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA), Hanoi, Vietnam, November 24-26, 2022, pp. 1-6, doi: 10.1109/O-COCOSDA202257103.2022.9997944
Rui Liu, Berrak Sisman, Bj ̈orn W. Schuller, Guanglai Gao, Haizhou Li, "Accurate Emotion Strength Assessment for Seen and Unseen Speech Based on Data-Driven Deep Learning", in Proc. Interspeech 2022, Songdo ConvensiA, in Incheon, Korea, September 18 to 22, 2022.
Jiadong Wang, Xinyuan Qian, Zihan Pan, Malu Zhang, and Haizhou Li, “GCC-PHAT with Speech-oriented Attention for Robotic Sound Source Localization”, in Proc. IEEE International Conference on Robotics and Automation (ICRA), Xian, China, 2021.
Kun Zhou, Berrak Sisman, and Haizhou Li, “VAW-GAN for disentanglement and recomposition of emotional elements in speech,” in Proc. IEEE Spoken Language Technology (SLT), Shenzhen, China, January 2021. [link]
Hongqiang Du, Xiaohai Tian, Lei Xie, and Haizhou Li, “Optimizing voice conversion network with cycle consistency loss of speaker identity” in Proc. IEEE Spoken Language Technology (SLT), Shenzhen, China, January 2021. [link]
Zongyang Du, Kun Zhou, Berrak Sisman, and Haizhou Li, “Spectrum And Prosody Conversion for Cross-Lingual Voice Conversion with Cyclegan”, in Proc. Asia-Pacific Signal and Information Processing Association (APSIPA) Annual Summit and Conference (ASC), Auckland, New Zealand, December 2020, pp. 507-513. [link]
Junchen Lu, Kun Zhou, Berrak Sisman, and Haizhou Li, ” VAW-GAN for Singing Voice Conversion with Non-parallel Training Data”, in Proc. Asia-Pacific Signal and Information Processing Association (APSIPA) Annual Summit and Conference (ASC), Auckland, New Zealand, December 2020, pp. 514-519. [link]
Yi Zhou, Xiaohai Tian, Xuehao Zhou, Mingyang Zhang, Grandee Lee, Rui Liu, Berrak Sisman, and Haizhou Li, “NUS-HLT System for Blizzard Challenge 2020”, in Proc. Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge, Shanghai, China, October 2020, pp. 44-48. [link]
Xiaohai Tian, Zhichao Wang, Shan Yang, Xinyong Zhou, Hongqiang Du, Yi Zhou, Mingyang Zhang, Kun Zhou, Berrak Sisman, Lei Xie, and Haizhou Li, “The NUS & NWPU system for Voice Conversion Challenge 2020”, in Proc. Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020, Shanghai, China, October 2020, pp. 170-174. [link]
Xinyuan Zhou, Emre Yılmaz, Yanhua Long, Yijie Li and Haizhou Li, “Multi-Encoder-Decoder Transformer for Code-Switching Speech Recognition,” in Proc. INTERSPEECH, Shanghai, China, October 2020, pp. 1042-1046. [link]
Xinyuan Zhou, Grandee Lee, Emre Yılmaz, Yanhua Long, Jiaen Liang and Haizhou Li, “Self-and-Mixed Attention Decoder with Deep Acoustic Structure for Transformer-based LVCSR,” in Proc. INTERSPEECH, Shanghai, China, October 2020, pp. 5016-5020. [link]
Nana Hou, Chenglin Xu, Joey Tianyi Zhou, Eng Siong Chng and Haizhou Li, “Multi-task Learning for End-to-end Noise-robust Bandwidth Extension”, in Proc. INTERSPEECH, Shanghai, China, October 2020, pp. 4069-4073. [link]
Nana Hou, Chenglin Xu, Van Tung Pham, Joey Tianyi Zhou, Eng Siong Chng and Haizhou Li, “Speaker and Phoneme-Aware Speech Bandwidth Extension with Residual Dual-Path Network”, in Proc. INTERSPEECH, Shanghai, China, October 2020, pp. 4064-4068. [link]
Xiaoxue Gao, Xiaohai Tian, Yi Zhou, Rohan Kumar Das and Haizhou Li, “Personalized Singing Voice Generation Using WaveRNN” in Proc. Speaker Odyssey, Tokyo, Japan, November 2020, pp. 252-258. [link]
Xianghu Yue, Grandee Lee, Emre Yılmaz, Fang Deng and Haizhou Li, “End-to-End Code-Switching ASR for Low-Resourced Language Pairs”, in Proc. IEEE Automatic Speech Recognition Understanding (ASRU) Workshop, Sentosa Island, Singapore, December 2019, pp. 972-979. [link]
Qinyi Wang, Emre Yılmaz, Adem Derinel and Haizhou Li, “Code-Switching Detection Using ASR-Generated Language Posteriors”, in Proc. INTERSPEECH, Graz, Austria, September 2019, pp. 3740-3744. [link]
Chenglin Xu, Wei Rao, Eng Siong Chng and Haizhou Li, “Optimization of Speaker Extraction Neural network with Magnitude and Temporal Spectrum Approximation Loss”, In. Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Brighton, United Kingdom, May 2019, pp. 6990-6994. [link]
Grandee Lee and Haizhou Li “Word and Class Common Space Embedding for Code-switch Language Modeling”, In. Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Brighton, United Kingdom, May 2019, pp. 6086-6090. [link]
Chenglin Xu, Wei Rao, Eng Siong Chng and Haizhou Li, “Time-Domain Speaker Extraction Network”, in Proc. IEEE Automatic Speech Recognition Understanding (ASRU) Workshop, Sentosa Island, Singapore, December 2019, pp. 327-334. [link]