3260 papers • 126 benchmarks • 313 datasets
Spot a given query keyword in a silent talking face video
(Image credit: Papersgraph)
These leaderboards are used to track progress in visual-keyword-spotting-7
Use these libraries to find visual-keyword-spotting-7 models and implementations
No subtasks available.
A novel convolutional architecture, KWS-Net, that uses a similarity map intermediate representation to separate the task into sequence matching, and pattern detection, to decide whether and when a word of interest is spoken by a talking face, with or without the audio.
This paper proposes the use of a grapheme-to-phoneme encoder-decoder model which learns how to map words to their pronunciation, and demonstrates that this system obtains very promising visual-only KWS results on the challenging LRS2 database, for keywords unseen during training.
LipLearner leverages contrastive learning to learn efficient lipreading representations, enabling few-shot command customization with minimal user effort, and exhibits high robustness to different lighting, posture, and gesture conditions on an in-the-wild dataset.
A novel architecture is proposed, the Transpotter, that uses full cross-modal attention between the visual and phonetic streams in Transformer-based models that ingest two streams, a visual encoding of the video and a phonetic encoding of a keyword, and output the temporal location of the keyword if present.
Adding a benchmark result helps the community track progress.