Visual-Aware End-to-End Speech Recognition in Noisy Setting

Human communication is multi-modal and includes both the visual and audio cues. Modern technology makes it possible to capture both aspects of communication in natural settings. This work is a new step towards transcribing audio, including speech, audio events, and video description. This project enriches transcription in noisy settings by understanding the environment through audio and visual cues. Thus, given a predefined collection of audio cues and visual scenes, this study integrates these elements with the video’s subtitles.

This is a weekly blog that I maintain and update as I progress in the project. Note understanding this blog requires you to go through my proposal -

https://drive.google.com/file/d/1vAMdR_4jx5hktFh8cXaUHwpEj_PV0-Sv/view?usp=sharing

Week 1 ( May 29, 2024 to June 7, 2024)

# labels avoided here
'/m/02zsn','/m/05zppz','/m/07qfr4h','/m/09x0r','/m/0brhx','/m/0ytgt','/t/dd00005','/t/dd00004', '/t/dd00003','/m/0dgw9r','/m/015lz1'

Week 2(June 8, 2024 to June 14, 2024)