3260 papers • 126 benchmarks • 313 datasets
Generation of gestures, as a sequence of 3d poses
(Image credit: Papersgraph)
These leaderboards are used to track progress in 3d-shape-generation
Use these libraries to find 3d-shape-generation models and implementations
No subtasks available.
The key system modules and the benchmark environments of the new release robosuite v1.0 are discussed.
A method for cross-modal translation from "in-the-wild" monologue speech of a single speaker to their conversational gesture motion is presented and significantly outperforms baseline methods in a quantitative comparison.
The statistical analysis on BEAT demonstrates the correlation of conversational gestures with facial expressions, emotions, and semantics, in addition to the known correlation with audio, text, and speaker identity, and proposes a baseline model, Cascaded Motion Network (CaMN), which consists of above six modalities modeled in a cascaded architecture for gesture synthesis.
This paper reports on the second GENEA Challenge to benchmark data-driven automatic co-speech gesture generation, using the same speech and motion dataset to build gesture-generation systems to benchmark human-likeness from gesture appropriateness.
A novel speech-to-motion generation framework in which the face, body, and hands are modeled separately, and a cross-conditional autoregressive model that generates body poses and hand gestures, leading to coherent and realistic motions is proposed.
This paper presents an automatic gesture generation model that uses the multimodal context of speech text, audio, and speaker identity to reliably generate gestures that are human-like and that match with speech content and rhythm.
Experiments demonstrate that EMAGE generates holistic gestures with state-of-the-art performance and is flexible in accepting predefined spatial-temporal gesture inputs, generating complete, audio-synchronized results.
A large span in human-likeness between challenge submissions is found, with a few systems rated close to human mocap, and a dyadic system being highly appropriate for agent speech does not necessarily imply high appropriateness for the interlocutor.
A novel framework for automatic speech-driven gesture generation, applicable to human-agent interaction including both virtual agents and robots, using a denoising autoencoder neural network and a novel encoder network.
This work presents a model designed to produce arbitrary beat and semantic gestures together, which takes both acoustic and semantic representations of speech as input, and generates gestures as a sequence of joint angle rotations as output.
Adding a benchmark result helps the community track progress.