Visual-Aware End-to-End Speech Recognition in Noisy Setting

Human communication is multi-modal and includes both the visual and audio cues. Modern technology makes it possible to capture both aspects of communication in natural settings. This work is a new step towards transcribing audio, including speech, audio events, and video description. This project enriches transcription in noisy settings by understanding the environment through audio and visual cues. Thus, given a predefined collection of audio cues and visual scenes, this study integrates these elements with the video’s subtitles.

This is a weekly blog that I maintain and update as I progress in the project. Note understanding this blog requires you to go through my proposal -

https://drive.google.com/file/d/1vAMdR_4jx5hktFh8cXaUHwpEj_PV0-Sv/view?usp=sharing

Week 1 ( May 29, 2024 to June 7, 2024)

This week I primarily focussed on downloading dataset samples using yt-dlp and filtering out samples based on tags assigned to them.
After going through all the labels mentioned in CSV, I decided to avoid these labels that are related to speech which can potentially confuse the model since this noise would be added to actual audio samples in Peoples-Speech-Dataset.

# labels avoided here
'/m/02zsn','/m/05zppz','/m/07qfr4h','/m/09x0r','/m/0brhx','/m/0ytgt','/t/dd00005','/t/dd00004', '/t/dd00003','/m/0dgw9r','/m/015lz1'

Week 2(June 8, 2024 to June 14, 2024)

Final Dataset structure, plan and execution.
Since we have same length of 10secs clips in audioset dataset and mode and average of PS dataset audio clips is 14 secs, we plan to consider audio samples only till first 10 secs.
This avoids sudden change in visual context which doesn't occur in real-world.
This also gives us advantages with constant length of all data samples for effective training without padding.
For iteration 1 of the dataset we only consider samples with only one noise though there might be more than one noise occuring in a sample.
This simplifies the task now, as now the task for model is to predict which is noise is occuring and where it's ending only. There is no intervention of other noise tag starting in between the span of other noise.
Since it is noted that the dataset has a lot of skew wrt number of occurrences of samples and binned length of noises, we propose to use the same noise with different samples of PS audios to reduce the skew and allow us to has have a balanced dataset for iteration 1 .
This work is primarily planned for next week and all the stats, skewness, assumptions for iteration 1 dataset are in next week.