[Seminar] How can crowd animation be created driven by a large language model (LLM)?

Рет қаралды 115

Күн бұрын

Speaker: Juyeong Hwang(dudyyyy4@khu.ac.kr, github.com/Juy...)
Target paper: Text-Guided Synthesis of Crowd Animation, 2024, ACM SIGGRAPH 2024 Conference Papers

Пікірлер: 13

@LRH_iiixrlab Күн бұрын

Thank you for the presentation. What specific strategies were used to ensure that the diffusion model effectively captures both spatial relationships and semantic meaning from the input text? Given that the diffusion model is central to generating the start and goal points based on the input text, what measures were taken to ensure that it accurately reflects complex spatial relationships (like crossing intersections or entering specific areas) and also maintains a coherent connection to the semantic meaning of the instructions?

@한동현Han Күн бұрын

Thank you for the presentation. I have two questions. Q1: Why do we have to use a diffusion model? What are the specific advantages of using it, and what are the alternatives? Q2: How can we train the policy to generate the velocity map? The input, output, model architecture, and reward function were not provided in the presentation. Could you briefly explain them again?

@노성래99 Күн бұрын

Thank you for the presentation. Q1: I believe that it is important that individual velocity vectors appear properly. I don't think denoising will be easy because individual vectors actually occupy a small size in the image. What special methods did the authors use for this? I'm curious about the authors' insights. Q2: I think you can specify start and goal by LLM model, not by Start/Goal Diffusion model. What do you think about this? What is the difference between using LLM and the method presented in the paper?

@박주현-123 Күн бұрын

Thank you for the presentation Q1: It appears that the input mainly focuses on describing the crowd's movement flow. Could you explain how the visual aspects of the maps and individual characters are designed? Q2: Is the CLIP model only used for embedding text prompts and not for images? If so, I wonder why they selected the CLIP model for embedding texts, when there exist other models that are designed specifically for text embeddings.

@정승재_teclados078 Күн бұрын

Thank you for you presentation. Q1. Since the paper generates simulations through various rules and settings due to the lack of real crowd simulation data, I am curious about how the authors evaluate whether the simulated data aligns with actual crowd behavior in diverse scenarios. As the simulated data itself is created according to the authors' intentions, shouldn’t the data used for training also be evaluated separately? Q2. The paper sets the crowd simulation with generic agent groups and their environment. However, there could be groups with distinct characteristics, such as idol fans, journalists, or holiday travelers, as well as random groups that wander leisurely instead of following the fastest path. Are there any studies related to these types of crowd behaviors?

@misong-kim Күн бұрын

Thank you for your presentation. Q1. Is there a specific reason why the CLIP model was used in the paper? I’m curious if the CLIP model is also suitable for describing crowd movements. Q2. I’m wondering if the image-based representation is commonly used in crowd simulation. If there are other methods, what advantages does this approach have compared to them?

@포로뤼 Күн бұрын

Thank you for the presentation. It seems that velocity maps are generated for each group after splitting the entire sentence into groups. In this case, isn't there a possibility of collisions between the groups? I am curious if there is any specific handling for such cases.

@SeungWonSeo-q3s 3 күн бұрын

Presentation was well received. I have a few questions and am leaving a comment here. Q1. How can the Denoised Velocity Field and Denoised Start/Goal Map be applied in a simulation environment? I’m unsure how these can be transitioned into the simulation stage. Q2. Is the velocity field typically represented using images in other fields as it was in this paper? Or are there other representation methods? If so, were there any experiments conducted on these alternative methods?

@홍성은-iiixr Күн бұрын

Thank you for your presentation. I have two questions. 1. In 7:53, why is the last one constraints necessary? 2. When generating sentences, do you generate them in the same way? When the user directly provides text input, won't unfamiliar word combinations that were not used in learning be processed well?

@critbear Күн бұрын

The crowd simulation is so interesting field. However, it seems to difficult to evaluate which study is better. Are ATD, SSR, and RSR used in quantitative analysis common evaluation methods in crowd simulation studies? If so, can you explain wht each is used in the evaluation?

@RounLee0927 17 сағат бұрын

Thank you for your seminar. I have two questions. Q1. Considering crowds in games or VR content, it is possible to interact with individual characters. Can this approach also reflect this real-time interactivity? Q2. In a crowd, aside from just movement, people might engage in various behaviors such as waiting or walking while talking with a friend. Can this approach take such diverse actions into account?

@tjswodud-c6c 6 күн бұрын

Thank you for your presentation. I have two questions: 1. In the figure on p.17, I don't clearly understand how the starting points of the different agent groups are represented: do the green and red squares correspond to the starting points? 2. In the same page, is there any particular reason why the agent gropus have to follow the underlined constraints? I would appreciate a more detailed explanation on this. Thank you.

@IIIXRLab Күн бұрын

[Overall] The use of animations, masking, and highlighting to guide the flow of the presentation and keep the audience engaged has shown significant improvement compared to previous presentations. However, there is a noticeable lack of formality in the presentation slides. [Feedback] F1: Overall, the formatting of the PPT lacks professionalism. It is crucial to practice adhering to basic rules of spacing, indentation, and bullet usage. Slide 4: Inconsistent spacing between text and bullet Slide 8: Uneven spacing between bullets Slides 15-18: Inconsistent spacing after bullets Slide 20: Incorrect spacing after bullets After slide 23: Ongoing issues with spacing after bullets These formatting issues can undermine the presentation's credibility and professionalism, so it’s important to correct them. [Question] The current method of feeding "A large group enters from the top left entrance..." into CLIP and then using that with the diffusion model seems problematic. CLIP doesn't seem well-suited for this task. It raises the suspicion on does the velocity diffusion model merely uses text as a prior to determine direction and speed while relying on collision avoidance or velocity field management for the rest. This makes it seem unlikely that complex trajectories or sequential behaviors can be generated. It feels like the system is trying to suggest that crowd simulation is controlled via text, but in reality, LLMs act more like parsers, providing simple rules for a few groups, and then just replaying those movements. Hence, I have the following questions: Q1: Beyond simple movement patterns, can the system display sequential but different behaviors? For example, could a scenario be created where a group circles a fountain twice, then sits on benches, and repeats this behavior? Q2: Does this paper claim that using LLM has a contribution to their study? If so, what is that contribution? Q3: Is there any performance evaluation or mention of limitations of using CLIP in this work? Q4: Is this paper simply replaying predefined animations, rather than handling agents dynamically interacting within a scene? It seems like the focus is on exiting, rather than generating complex, continuous behavior for the agents.