3260 papers • 126 benchmarks • 313 datasets
Detecting activities in extended videos.
(Image credit: Papersgraph)
These leaderboards are used to track progress in activity-detection-11
Use these libraries to find activity-detection-11 models and implementations
No subtasks available.
Novel inference algorithms for an end-to-end Recurrent Neural Network trained with the Connectionist Temporal Classification loss function are developed which allow the model to achieve high accuracy on both keyword spotting and voice activity detection without retraining.
A new model, Region Convolutional 3D Network (R-C3D), is introduced, which encodes the video streams using a three-dimensional fully convolutional network, then generates candidate temporal regions containing activities, and finally classifies selected regions into specific activities.
A modified version of rVAD is presented where computationally intensive pitch extraction is replaced by computationally efficient spectral flatness calculation, which significantly reduces the computational complexity at the cost of moderately inferior VAD performance, which is an advantage when processing a large amount of data and running on low resource devices.
This paper experimentally compares various recognition approaches capturing temporal structure in activity videos, by classifying segmented videos and extending those approaches to continuous videos and finds that learning temporal structure is valuable for fine-grained activity recognition.
This work introduces pyannote.audio, an open-source toolkit written in Python for speaker diarization, which provides a set of trainable end-to-end neural building blocks that can be combined and jointly optimized to build speaker darization pipelines.
This work creates SC with multiple speakers per conversation and shows that they allow for substantially better performance than SM, also reducing the dependence on a fine-tuning stage.
This system is useful for gating the inputs to a streaming on-device speech recognition system, such that it only triggers for the target user, which helps reduce the computational cost and battery consumption, especially in scenarios where a keyword detector is unpreferable.
The proposed deep-learning-based RF sensing achieves near-perfect presence detection during multiple extended periods of test and exhibits superior performance compared with leading edge passive infrared sensors.
Adding a benchmark result helps the community track progress.