Speech gesture generation from the trimodal context of text, audio, and speaker identity (2020-09-04T00:00:00.000000Z)

TL;DR

This paper presents an automatic gesture generation model that uses the multimodal context of speech text, audio, and speaker identity to reliably generate gestures that are human-like and that match with speech content and rhythm.

Abstract

For human-like agents, including virtual avatars and social robots, making proper gestures while speaking is crucial in human-agent interaction. Co-speech gestures enhance interaction experiences and make the agents look alive. However, it is difficult to generate human-like gestures due to the lack of understanding of how people gesture. Data-driven approaches attempt to learn gesticulation skills from human demonstrations, but the ambiguous and individual nature of gestures hinders learning. In this paper, we present an automatic gesture generation model that uses the multimodal context of speech text, audio, and speaker identity to reliably generate gestures. By incorporating a multimodal context and an adversarial training scheme, the proposed model outputs gestures that are human-like and that match with speech content and rhythm. We also introduce a new quantitative evaluation metric for gesture generation models. Experiments with the introduced metric and subjective human evaluation showed that the proposed gesture generation model is better than existing end-to-end generation models. We further confirm that our model is able to work with synthesized audio in a scenario where contexts are constrained, and show that different gesture styles can be generated for the same speech by specifying different speaker identities in the style embedding space that is learned from videos of various speakers. All the code and data is available at https://github.com/ai4r/Gesture-Generation-from-Trimodal-Context.

Authors

Minsu Jang

3 papers

Jaehong Kim

3 papers

Youngwoo Yoon

3 papers

TL;DR

Abstract

Authors

References70 items

Style‐Controllable Speech‐Driven Gesture Synthesis Using Normalising Flows

Gesticulator: A framework for semantically-aware speech-driven gesture generation

Multi-objective adversarial gesture generation

C-3PO: Cyclic-Three-Phase Optimization for Human-Robot Motion Retargeting based on Reinforcement Learning

Fréchet Audio Distance: A Reference-Free Metric for Evaluating Music Enhancement Algorithms

On the "steerability" of generative adversarial networks

Language2Pose: Natural Language Grounded Pose Forecasting

Towards Social Artificial Intelligence: Nonverbal Social Signal Prediction in a Triadic Interaction

Learning Individual Styles of Conversational Gesture

FVD: A new Metric for Video Generation

Analyzing Input and Output Representations for Speech-Driven Gesture Generation

Diversity-Sensitive Conditional Generative Adversarial Networks

Fréchet Audio Distance: A Metric for Evaluating Music Enhancement Algorithms

3D Human Pose Estimation in Video With Temporal Convolutions and Semi-Supervised Training

Robots Learn Social Skills: End-to-End Learning of Co-Speech Gesture Generation for Humanoid Robots

UMAP: Uniform Manifold Approximation and Projection

Multimodal Continuous Turn-Taking Prediction Using Multiscale RNNs

Learning Sentiment-Specific Word Embedding via Global Sentiment Representation

An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling

Pros and Cons of GAN Evaluation Measures

The Unreasonable Effectiveness of Deep Features as a Perceptual Metric

Why rate when you could compare? Using the “EloChoice” package to assess pairwise comparisons of perceived physical strength

Hand Gestures and Verbal Acknowledgments Improve Human-Robot Rapport

Speech-driven Animation with Meaningful Behaviors

GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium

Multimodal Machine Learning: A Survey and Taxonomy

Enriching Word Vectors with Subword Information

Improved Techniques for Training GANs

Predicting Co-verbal Gestures: A Deep and Temporal Modeling Approach

Folk Dance Evaluation Using Laban Movement Analysis

GloVe: Global Vectors for Word Representation

Neural Machine Translation by Jointly Learning to Align and Translate

Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments

Generative Adversarial Nets

Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation

Synchronization of speech and gesture: evidence for interaction in action.

Learning-Based Modeling of Multimodal Behaviors for Humanlike Robots

Gesture and speech in interaction: An overview

Stochastic Backpropagation and Approximate Inference in Deep Generative Models

Auto-Encoding Variational Bayes

Distributed Representations of Words and Phrases and their Compositionality

Virtual character performance from speech

The effects of robot-performed co-verbal gesture on listener behaviour

Gesture controllers

Gesture modeling and animation based on a probabilistic re-creation of speaker style

Towards a Common Framework for Multimodal Generation: The Behavior Markup Language

Extensions of Gaussian processes for ranking: semi-supervised and active learning

Gesture generation by imitation: from human behavior to computer character animation

BEAT: the Behavior Expression Animation Toolkit

How representational gestures help speaking

Understanding Motion Capture for Computer Animation and Video Games

Hand and Mind

WordNet: A Lexical Database for English

Hand and Mind: What Gestures Reveal about Thought

Nonverbal Behaviors, Persuasion, and Credibility

Learning internal representations by error propagation

Motion, Interaction and Games

Google Cloud Text-to-Speech

NAOqi API Documentation

GENERATIVE ADVERSARIAL NETS

Why Rate When You Could Compare? Using the łEloChoicež

Gentle: A Forced Aligner

Effects of personality and social situation on representational gesture production

The Relation of Speech and Gestures: Temporal Synchrony Follows Semantic Synchrony

e Relation of Speech

Extensions of Gaussian Processes for Ranking : Semi-supervised and Active Learning

Gesture and Thought

Semi-supervised and Active Learning. Learning to Rank

On robust estimation of the location parameter

Speech Gesture Generation from the Trimodal Context of Text

Field of Study

Journal Information

Name

Page

Volume

Venue Information

e Relation of Speech