Vision and Machine Learning Laboratory

Welcome to Vision and Machine Learning Lab at National University of Singapore!
More up-to-date information can be found at: https://sites.google.com/view/showlab

We aim to build multimodal AI Assistant on various platforms, such as social media app, AR glass, robot, video/audio editing tool, with the ability of understanding video, audio, language collectively. This involves techniques like:

Video Understanding e.g. action detection, video pre-training, object detection & tracking, person re-ID, segmentation in space and time.
Multi-modal e.g. video+language, video+audio.
AI-Human cooperation and interaction

Zhengshou Stanford Photo

Lab Supervisor and PI: Assistant Professor Mike Shou

Bio: Prof. Shou is a tenure-track Assistant Professor at National University of Singapore. He was a Research Scientist at Facebook AI in Bay Area. He obtained his Ph.D. degree at Columbia University in the City of New York, working with Prof. Shih-Fu Chang. He was awarded Wei Family Private Foundation Fellowship. He received the best student paper nomination at CVPR'17. His team won the first place in the International Challenge on Activity Recognition (ActivityNet) 2017. He is a Fellow of National Research Foundation (NRF) Singapore, Class of 2021.

News:

Openings for PhD, PostDoc, Visiting PhD, etc. More details at: https://sites.google.com/view/showlab/join-us.
For sponsoring research project, please contact Prof. Shou at mike.zheng.shou AT gmail.com

Activities:

Workshop Organizer:
- CVPR 2021, LOVEU: Long-form Video Understanding Workshop & International Challenge
- ICCV 2021, SSLL: Share Stories and Lessons Learned
- ICCV 2021, SRVU: Structured Representations for Video Understanding
- ICCV 2021, EPIC: Ninth International Workshop on Egocentric Perception, Interaction and Computing: Introducing Ego4D - a massive first-person dataset and challenge

Recent Professional Services:
- Area Chair / Meta-reviewer / Senior Program Committee:
  - CVPR 2022
  - ACM MM 2021, IJCAI 2021
  - ACM MM 2020, ACM MM Asia 2020
- Reviewers:
  - Conferences: CVPR, ICCV, ECCV, AAAI, IJCAI, ICLR, NeurIPS, ICML, ACM MM, etc.
  - Journals: International Journal of Computer Vision, Transactions on Pattern Analysis and Machine Intelligence, etc.

Recent Talks:
- April 2021, Invited talk at the University of Bristol, "Generic Event Boundary Detection: A Benchmark for Event Segmentation"

Selected Publications

Full list of publications: [Google Scholar]

Generic Event Boundary Detection: A Benchmark for Event Segmentation.
Mike Zheng Shou, Stan W. Lei, Deepti Ghadiyaram, Weiyao Wang, Matt Feiszli.
International Conference on Computer Vision (ICCV), 2021. [arxiv]
The first large-scale taxonomy-free event segmentation benchmark. A stepping stone to addressing long-form video understanding. We organised a workshop called LOVEU at CVPR’21 along with competitions built upon this dataset. The competitions attracted 20+ participants!

Is Someone Speaking? Exploring Long-term Temporal Features for Audio-visual Active Speaker Detection.
Ruijie Tao, Zexu Pan, Rohan Kumar Das, Xinyuan Qian, Mike Zheng Shou, Haizhou Li.
ACM Multimedia, 2021. [AVA challenge report]
Leverage video + audio to detect active speakers. Secure the 3rd place in AVA challenge organised by Google Research at CVPR’21 ActivityNet.

On Pursuit of Designing Multi-modal Transformer for Video Grounding.
Meng Cao, Long Chen, Mike Zheng Shou, Can Zhang, Yuexian Zou.
Preprint. 2021.

Actor-Context-Actor Relation Network for Spatio-Temporal Action Localization.
Junting Pan, Siyu Chen, Mike Zheng Shou, Yu Liu, Jing Shao, Hongsheng Li.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021. [arxiv]
Localize action in space and time. Core technique of the 1st place in AVA challenge organised by Google Research at CVPR’20 ActivityNet.

SF-Net: Single-Frame Supervision for Temporal Action Localization.
Fan Ma, Linchao Zhu, Yi Yang, Shengxin Zha, Gourab Kundu, Matt Feiszli, Zheng Shou.
European Conference on Computer Vision (ECCV), 2020. Spotlight, acceptance rate top 5%. [arxiv]
A new form of weak supervision, comparable results to its fully-supervised counterpart with much cheaper annotation cost.

DMC-Net: Generating Discriminative Motion Cues for Fast Compressed Video Action Recognition.
Zheng Shou, Xudong Lin, Yannis Kalantidis, Laura Sevilla-Lara, Marcus Rohrbach, Shih-Fu Chang, Zhicheng Yan.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019. [arxiv]
A video model that learns discriminative motion cues directly from compressed video – fast & accurate.

AutoLoc: Weakly-supervised Temporal Action Localization in Untrimmed Videos.
Zheng Shou, Hang Gao, Lei Zhang, Kazuyuki Miyazawa, Shih-Fu Chang.
European Conference on Computer Vision (ECCV), 2018. [arxiv]

Online Detection of Action Start in Untrimmed, Streaming Videos.
Zheng Shou*, Junting Pan*, Jonathan Chan, Kazuyuki Miyazawa, Hassan Mansour, Anthony Vetro, Xavi Giró-i-Nieto, Shih-Fu Chang.
European Conference on Computer Vision (ECCV), 2018. [arxiv]

CDC: Convolutional-De-Convolutional Networks for Precise Temporal Action Localization in Untrimmed Videos.
Zheng Shou, Jonathan Chan, Alireza Zareian, Kazuyuki Miyazawa, Shih-Fu Chang.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. [arxiv] oral presentation, acceptance rate 2.6%, best student paper nomination.

ConvNet Architecture Search for Spatiotemporal Feature Learning.
Du Tran, Jamie Ray, Zheng Shou, Shih-Fu Chang, Manohar Paluri.
Technical Report, 2017. [arxiv] [github]
A open-source Res3D video backbone model that can support many video applications.

Single Shot Temporal Action Detection.
Tianwei Lin, Xu Zhao, Zheng Shou.
ACM Multimedia, 2017. [paper] [challenge report]
won the first place in both Temporal Action Proposal track and Temporal Action Localization track at the ActivityNet Challenge.

Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs.
Zheng Shou, Dongang Wang, and Shih-Fu Chang.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. [arxiv]
A pioneering work that proposes the first deep learning framework for temporal action localization in video.