[Seminar] Categorical Codebook Matching for Embodied Character Controllers

Рет қаралды 228

Күн бұрын

Пікірлер: 28

@IIIXRLab 5 ай бұрын

Please review each point of feedback carefully, and respond with how you plan to address each one moving forward. [Overall] Lack of Content Distillation: The presentation lacks sufficient distillation of the content. It is essential to distill the core technical concepts and express them in your own words rather than relying heavily on the terminology and expressions found in the paper. This makes me question your depth of understanding of the technique. Lack of Visualization and Script Length: The presentation suffers from a lack of visual aids and is burdened by lengthy scripts, which poses challenges for both me and the audience. Missing Core Concepts: Understanding the construction of the codebook, the sampling theorem, and the matching technique seems fundamental to this paper. However, your presentation only touches on these topics superficially. Why is that? Please prepare a supplementary PowerPoint to enhance our lab members' understanding. [Feedback] F1: The presentation overall feels like you're just reading, with a tendency to break words into segments. For example, pronouncing "respectively" as "respec...tively" makes it hard to follow. Since you're taking the time to record, it would be better to practice reading more fluidly beforehand. F2: Simply reading the script isn't effective for a good presentation. You're not reading a storybook; you need to provide more in-depth explanations to help the audience understand the content better. Particularly, pages 7 to 12. On page 11, you should provide more explanation about the metrics and illustrations below "categorical sampling." You need to go beyond what's in the script and offer additional insights to enhance the audience's understanding. The same goes for page 12. Reading the long script verbatim is challenging for both me and the audience. Instead of copying text directly from the paper, you should distill the content into your own words and present it in bullet points. F3: The presentation lacks appropriate visual aids. Concepts like "end-to-end manner," "two stage," "discrete," and "continuous" could be explained much more effectively using better visualization materials. F4: The explanation of the architecture and codebook is too simplistic. You need to explain page 6 more thoroughly by providing a deeper theoretical background. [Question] Q1: How is the codebook for motions constructed? Q2: What is the principle behind codebook sampling? How are they performed? Q3: What does the visualized codebook on page 22 represent? What does it mean to sample multiple times on page 25? What does color mean? Q4. How many motions can be distinguished simultaneously?

@stevemurphy817 7 күн бұрын

great slides!

@LRH_iiixrlab 5 ай бұрын

I appreciate your insightful presentation. Could you please explain the process of combining predicted future trajectories with additional control signals in more detail? Specifically, how does your model integrate these control signals, such as those from joystick or button inputs, with the predicted motion trajectories to achieve a coherent and responsive character control?

@포로뤼 4 ай бұрын

1. When joystick or controller button inputs are received, the root transformation of the character R_F is predicted. - This is computed similar to "DeepPhase: Periodic autoencoders for learning motion phase manifolds" and estimated by smoothly extrapolating the current root of the avatar to the desired target transformation. 2. Then, the predicted R_F is combined with the previously computed trajectory T_F+ via matrix multiplication.

@황주영-j7y 5 ай бұрын

Thank you for the interesting seminar. 1) Do you always spit out the same output for the same input? 2) Does the motion also reflect the user's speed of movement? I would like to know if the time attribute is included.

@포로뤼 4 ай бұрын

1. No, different motions can be generated even with the same input. Due to the nature of the 3-point input, there can be various lower-body motions possible for the same upper-body motion 2. The current state of the character is used as input, which includes velocity (i.e., it includes temporal attributes). The full-body motion is generated according to the speed at which the 3-point moves.

@박주현-123 5 ай бұрын

Thank you for your presentation. In slide number 6, how is the motion represented digitally to be passed to an encoder, and how is the encoder designed? Does it use convolutional layers like image encoders?

@포로뤼 4 ай бұрын

Around 10:00 in the video, the input 𝑋 = ( 𝑆_𝑖 , 𝐶_𝑓 ) is passed to the encoder. Both the encoder and decoder use two fully connected hidden layers of size 1024.

@misong-kim 5 ай бұрын

Thank you for presentation In this paper, I am curious whether the model was trained using GT data for the relationship between head and hand inputs and the resulting natural animation. If so, could you explain how this GT data was collected?

@포로뤼 4 ай бұрын

Yes, the model was trained using GT data, which consists of standard motion data. This dataset includes about 3 hours of locomotion and interaction behaviors collected by the authors. In this motion data, the transformations of the head and both hand joints are used as the 3-point input.

@SeungWonSeo-q3s 5 ай бұрын

Thank you for your presentation. I have a question regarding continuous space. Why does the continuous latent space allow for incorrect interpolation with unseen input vectors? As far as I know, many existing methods generate motion in a continuous latent space. What differences between motion interpolation and generation cause this issue?

@포로뤼 4 ай бұрын

The continuous space can be particularly problematic for the problem addressed by this paper. This is because 3-point tracking can result in multiple outputs for the same input. In such a scenario, performing interpolation in the latent space increases the likelihood of generating incorrect intermediate states between different motions.

@정승재_teclados078 5 ай бұрын

Thank you for presentation. I have two questions. First, what kind of motion sequences data were used to train the codebook matching? Basic motions like walking or arm swinging seem to easily allow for 3-point based prediction, but I wonder if the network architecture can learn long sequences or features with diverse variations to train for more complex tasks. Secondly, I believe that even with different movements, there could be a motion set where the 3-point motion sequence matches to some extent. Especially for the lower body, there will definitely be data where the posture can differ even if the 3-point motion sequence is the same. Was this simply left as a limitation, or were there experiments conducted related to this?

@포로뤼 4 ай бұрын

1. According to the authors, they used about 3 hours of motion data involving various locomotion and interaction behaviors. However, they didn't specify detailed categories of actions. For complex motions, the upper body is generated to closely follow the given 3-point input, but the lower body motion may differ from reality. Instead of a variable-length motion sequence being used as input, an autoregressive approach is employed where a fixed length of past frames is inputted at each frame to predict the motion of the current frame, so the sequence length doesn't matter. 2. As you mentioned, even if the 3-point input is the same, there can be many cases where the actual pose differs, leading to significant discrepancies between the real lower body and the generated lower body. This limitation arises because the 3-point input lacks sufficient information about the lower body. To address this issue, there is ongoing research aimed at using additional signals (such as IMU sensors or HMD camera images) to obtain lower body information.

@홍성은-iiixr 5 ай бұрын

Thank you for your representation. In codebook matching part, MSE function is used to matching the codebooks, including Zx, Zy. Then there are no research for replacing MSE to other function likes KL or MAE? Thanks~

@포로뤼 4 ай бұрын

Since the technique of codebook matching is introduced for the first time in this study, there is no research attempting to compare it with other functions. Also, in this paper, no experiments were conducted to compare codebook matching using functions other than MSE.

@선재영-iiixr 5 ай бұрын

Thank you for your kind and great presentation. I have two questions regarding categorical codebook matching: 1) Is there a particular reason for sampling only one code vector from Z_y during training, as opposed to sampling K codes from X during testing? (If not explained in the paper, please tell me your thoughts). 2) Also, is K, the number of codes to be sampled, a hyperparameter that can be set? I was wondering if there is an optimal value of K that the authors suggest. Thank you.

@포로뤼 4 ай бұрын

1. Although it's not explained in the paper, I think sampling K values likely doesn't aid training and only increases training time due to multiple sampling runs. The goal of the training session is to structure the codebook and train the decoder to accurately reconstruct the output motion, while ensuring that Z_x closely follows Z_y . Since sampling K values doesn't seem to contribute to achieving these goals, it was probably not used. 2. Yes, 𝐾 K is a hyperparameter. It was found that using a 𝐾 K value between 10 and 20 worked best.

@한동현Han 5 ай бұрын

How much of the range of arbitrary motion can be covered? It seems like it could handle the problem of arbitrary upper-body motion, but what about leg motions such as kicking, dancing, and crossing?

@포로뤼 4 ай бұрын

I'm not sure exactly how much can be covered. This will likely depend on the range of motions covered by the dataset used during training. As you mentioned, the upper body movements can be generated well. For the lower body, the most likely motion will be generated based on the given 3-point motion and the previous pose. Therefore, if the 3-point motion has distinct characteristics, the matching lower body will be generated accurately, but if not, there is a higher chance that the generated lower body motion may differ from the actual.

@RounLee0927 5 ай бұрын

How do they generate the finger motion? Is it just using 3 point tracking? Or are voxel sensor or trigger values also used?

@포로뤼 4 ай бұрын

The detailed method for generating finger movements is not described in the paper, so it may be difficult to explain. For interactions, such as petting a rabbit, voxel sensors were used to detect nearby colliders and train the model to generate collision-avoidance motions. There was no mention of using trigger values.

@critbear 5 ай бұрын

The original VQ-VAE suffers from codebook collapse. Did this study employ any training strategies to prevent this issue?

@포로뤼 4 ай бұрын

If the two-stage approach is used, in the first stage, the VQ-VAE model will reconstruct full-body motion data and form the latent space. However, since the 3-point input is not considered in this process, only a portion of the latent space may be predominantly used in the second stage. This can lead to codebook collapse, where only certain codes are excessively used. In contrast, the end-to-end learning approach proposed in this paper allows the model to learn both the input and output information used during actual inference during training, enabling the formation of a more appropriate latent space. While this method may not completely prevent codebook collapse, it can help mitigate the issue.

@김병민_dm 5 ай бұрын

Thank you for your seminar. I have a question. In categorical sampling, what is the "most suitable" one? And how it is selected?

@포로뤼 4 ай бұрын

The distance between each pair of joints across all sampled motion sequences and the current pose is calculated, and the motion with the shortest distance is selected as the most suitable one.

@노성래99 5 ай бұрын

Thank you for presentation. I have two questions. 1. I'm a little confused about the purpose of motion selection process. What is the exact difference between motion and predicting the probability of motion? To me, choosing the highest logit value among motion categories after the final layer of the model seems like a trivial approach. Is it possible to predict something based on neural networks without modeling probabilities? (Regardless of domain-specific problems such as ambiguity in the motion generation process) 2. Since error accumulation in autoregressive generation is independent of latent space modeling (somewhat continuous or discrete), what are the main properties of discrete latent space that 'mitigate' the distribution shift problem caused by error accumulation? I think arguing 'quantization' solely on this is not sufficient.

@포로뤼 4 ай бұрын

1. A simple Feed Forward network, CNN, RNN, or similar networks that directly map a control input X to a motion Y would fall into this category. The models used in the author's previous works, such as "Phase-Functioned Neural Networks for Character Control (PFNN)" or "Neural State Machine for Character-Scene Interactions (NSM)," are examples of this. 2. In a discrete latent space, inputs are mapped to predefined codebook codes, ensuring that even if errors occur, the selected code remains within the codebook. This effectively normalizes the range of possible latent vectors, mitigating the large deviations that can occur in a continuous space.