Paper Presentation
Find your poster board id here: Poster Board Assignment1 |
|
T2LM: Long-Term 3D Human Motion Generation from Multiple Sentences Abstract: In this paper, we address the challenging problem of long-term 3D human motion generation. Specifically, we aim to generate a long sequence of smoothly connected actions from a stream of multiple sentences (i.e., paragraph). Previous long-term motion generating approaches were mostly based on recurrent methods, using previously generated motion chunks as input for the next step. However, this approach has two drawbacks: 1) it relies on sequential datasets, which are expensive; 2) these methods yield unrealistic gaps between motions generated at each step. To address these issues, we introduce simple yet effective T2LM, a continuous long-term generation framework that can be trained without sequential data. T2LM comprises two components: a 1D-convolutional VQVAE, trained to compress motion to sequences of latent vectors, and a Transformer-based Text Encoder that predicts a latent sequence given an input text. At inference, a sequence of sentences is translated into a continuous stream of latent vectors. This is then decoded into a motion by the VQVAE decoder; the use of 1D convolutions with a local temporal receptive field avoids temporal inconsistencies between training and generated sequences. This simple constraint on the VQ-VAE allows it to be trained with short sequences only and produces smoother transitions. T2LM outperforms prior long-term generation models while overcoming the constraint of requiring sequential data; it is also competitive with SOTA single-action generation models. |
2 |
|
Speech2UnifiedExpressions: Synchronous Synthesis of Co-Speech Affective Face and Body Expressions from Affordable Inputs Abstract: We present a multimodal learning-based method to simultaneously synthesize co-speech facial expressions and upper-body gestures for digital characters using RGB video data captured using commodity cameras. Our approach learns from sparse face landmarks and upper-body joints, estimated directly from video data, to generate plausible emotive character motions. Given a speech audio waveform and a token sequence of the speaker's face landmark motion and body-joint motion computed from a video, our method synthesizes the full sequence of motions for the speaker's face landmarks and body joints that match the content and the affect of the speech. To this end, we design a generator consisting of a set of encoders to transform all the inputs into a multimodal embedding space capturing their correlations, followed by a pair of decoders to synthesize the desired face and pose motions. To enhance the plausibility of our synthesized motions, we use an adversarial discriminator that learns to differentiate between the face and pose motions computed from the original videos and our synthesized motions based on their affective expressions. To evaluate our approach, we extend the TED Gesture Dataset to include view-normalized, co-speech face landmarks in addition to body gestures. We demonstrate the performance of our method through thorough quantitative and qualitative experiments on multiple evaluation metrics and via a user study, and observe that our method results in low reconstruction error and produces synthesized samples with diverse facial expressions and body gestures for digital characters. We will release the extended dataset as the TED Gesture+Face Dataset consisting of 250K samples and the relevant source code. |
3 |
|
Exploring Text-to-Motion Generation with Human Preference Abstract: This paper presents an exploration of preference learning in text-to-motion generation. We find that current improvements in text-to-motion generation still rely on datasets requiring expert labelers with motion capture systems. Instead, learning from human preference data does not require motion capture systems; a labeler with no expertise simply compares two generated motions. This is particularly efficient because evaluating the model's output is easier than gathering the motion that performs a desired task (e.g. backflip). To pioneer the exploration of this paradigm, we annotate 3,528 preference pairs generated by MotionGPT, marking the first effort to investigate various algorithms for learning from preference data. In particular, our exploration highlights important design choices when using preference data. Additionally, our experimental results show that preference learning has the potential to greatly improve current text-to-motion generative models. Our code and dataset will be publicly available to further facilitate research in this area. |
4 |
|
Two-Person Interaction Augmentation with Skeleton Priors Abstract: Close and continuous interaction with rich contacts is a crucial aspect of human activities (e.g. hugging, dancing) and of interest in many domains like activity recognition, motion prediction, character animation, etc. However, acquiring such skeletal motion is challenging. While direct motion capture is expensive and slow, motion editing/generation is also non-trivial, as complex contact patterns with topological and geometric constraints have to be retained. To this end, we propose a new deep learning method for two-body skeletal interaction motion augmentation, which can generate variations of contact-rich interactions with varying body sizes and proportions while retaining the key geometric/topological relations between two bodies. Our system can learn effectively from a relatively small amount of data and generalize to drastically different skeleton sizes. Through exhaustive evaluation and comparison, we show it can generate high-quality motions, has strong generalizability and outperforms traditional optimization-based methods and alternative deep learning solutions. |
5 |
|
Multi-Track Timeline Control for Text-Driven 3D Human Motion Generation Abstract: Recent advances in generative modeling have led to promising progress on synthesizing 3D human motion from text, with methods that can generate character animations from short prompts and specified durations. However, using a single text prompt as input lacks the fine-grained control needed by animators, such as composing multiple actions and defining precise durations for parts of the motion. To address this, we introduce the new problem of timeline control for text-driven motion synthesis, which provides an intuitive, yet fine-grained, input interface for users. Instead of a single prompt, users can specify a multi-track timeline of multiple prompts organized in temporal intervals that may overlap. This enables specifying the exact timings of each action and composing multiple actions in sequence or at overlapping intervals. To generate composite animations from a multi-track timeline, we propose a new test-time denoising method. This method can be integrated with any pre-trained motion diffusion model to synthesize realistic motions that accurately reflect the timeline. At every step of denoising, our method processes each timeline interval (text prompt) individually, subsequently aggregating the predictions with consideration for the specific body parts engaged in each action. Experimental comparisons and ablations validate that our method produces realistic motions that respect the semantics and timing of given text prompts. |
6 |
|
DiffTED: One-shot Audio-driven TED Talk Video Generation with Diffusion-based Co-speech Gestures Abstract: Audio-driven talking video generation has advanced significantly, but existing methods often depend on video-to-video translation techniques and traditional generative networks like GANs and they typically generate taking heads and co-speech gestures separately, leading to less coherent outputs. Furthermore, the gestures produced by these methods often appear overly smooth or subdued, lacking in diversity, and many gesture-centric approaches do not integrate talking head generation. To address these limitations, we introduce DiffTED, a new approach for one-shot audio-driven TED-style talking video generation from a single image. Specifically, we leverage a diffusion model to generate sequences of keypoints for a Thin-Plate Spline motion model, precisely controlling the avatar's animation while ensuring temporally coherent and diverse gestures. This innovative approach utilizes classifier-free guidance, empowering the gestures to flow naturally with the audio input without relying on pre-trained classifiers. Experiments demonstrate that DiffTED generates temporally coherent talking videos with diverse co-speech gestures. |
7 |
|
A Cross-Dataset Study for Text-based 3D Human Motion Retrieval Abstract: We provide results of our study on text-based 3D human motion retrieval and particularly focus on cross-dataset generalization. Due to practical reasons such as dataset-specific human body representations, existing works typically benchmark by training and testing on partitions from the same dataset. Here, we employ a unified SMPL body format for all datasets, which allows us to perform training on one dataset, testing on the other, as well as training on a combination of datasets. Our results suggest that there exist dataset biases in standard text-motion benchmarks such as HumanML3D, KIT Motion-Language, and BABEL. We show that text augmentations help close the domain gap to some extent, but the gap remains. We further provide the first zero-shot action recognition results on BABEL, without using categorical action labels during training, opening up a new avenue for future research. |
8 |
|
in2IN: Leveraging Individual Information to Generate Human INteractions Abstract: Generating human-human motion interactions conditioned on textual descriptions is a very useful application in many areas such as robotics gaming animation and the metaverse. Alongside this utility also comes a great difficulty in modeling the highly dimensional inter-personal dynamics. In addition properly capturing the intra-personal diversity of interactions has a lot of challenges. Current methods generate interactions with limited diversity of intra-person dynamics due to the limitations of the available datasets and conditioning strategies. For this we introduce in2IN a novel diffusion model for human-human motion generation which is conditioned not only on the textual description of the overall interaction but also on the individual descriptions of the actions performed by each person involved in the interaction. To train this model we use a large language model to extend the InterHuman dataset with individual descriptions. As a result in2IN achieves state-of-the-art performance in the InterHuman dataset. Furthermore in order to increase the intra-personal diversity on the existing interaction datasets we propose DualMDM a model composition technique that combines the motions generated with in2IN and the motions generated by a single-person motion prior pre-trained on HumanML3D. As a result DualMDM generates motions with higher individual diversity and improves control over the intra-person dynamics while maintaining inter-personal coherence. |
9 |
|
Fake It to Make It: Using Synthetic Data to Remedy the Data Shortage in Joint Multimodal Speech-and-Gesture Synthesis Abstract: Although humans engaged in face-to-face conversation simultaneously communicate both verbally and non-verbally methods for joint and unified synthesis of speech audio and co-speech 3D gesture motion from text are a new and emerging field. These technologies hold great promise for more human-like efficient expressive and robust synthetic communication but are currently held back by the lack of suitably large datasets as existing methods are trained on parallel data from all constituent modalities. Inspired by student-teacher methods we propose a straightforward solution to the data shortage by simply synthesising additional training material. Specifically we use unimodal synthesis models trained on large datasets to create multimodal (but synthetic) parallel training data and then pre-train a joint synthesis model on that material. In addition we propose a new synthesis architecture that adds better and more controllable prosody modelling to the state-of-the-art method in the field. Our results confirm that pre-training on large amounts of synthetic data improves the quality of both the speech and the motion synthesised by the multimodal model with the proposed architecture yielding further benefits when pre-trained on the synthetic data. |
Abstract Presentations
|
Towards a GENEA Leaderboard – an Extended, Living Benchmark for Evaluating and Advancing Conversational Motion Synthesis |
|
Inter-X: Towards Versatile Human-Human Interaction Analysis |
|
InterHandGen: Two-Hand Interaction Generation via Cascaded Reverse Diffusion |
|
Rigplay: Movement database designed for Machine Learning |
|
MAS: Multi-view Ancestral Sampling for 3D motion generation using 2D diffusion |
|
ConvoFusion: Multi-Modal Conversational Diffusion for Co-Speech Gesture Synthesis |
|
Move as You Say, Interact as You Can: Language-guided Human Motion Generation with Scene Affordance |
|
MoMask: Generative Masked Modeling of 3D Human Motions |
|
FlowMDM: Seamless Human Motion Composition with Blended Positional Encodings |
|
Generating Continual Human Motion in Diverse 3D Scenes |
|
Scaling Up Dynamic Human-Scene Interaction Modeling |
|
OmniControl: Control Any Joint at Any Time for Human Motion Generation |
|
NRDF: Neural Riemannian Distance Fields for Learning Articulated Pose Priors |
Contact Info
E-mail: humogen2024@gmail.com