[KHU2024-1] Link's Adventure
3:41
[KHU2024-1] Novel2Room
2:52
3 ай бұрын
[KHU2024-1] Not Real Super Mario
3:32
Пікірлер
@LRH_iiixrlab
@LRH_iiixrlab Күн бұрын
Thank you for the presentation. What specific strategies were used to ensure that the diffusion model effectively captures both spatial relationships and semantic meaning from the input text? Given that the diffusion model is central to generating the start and goal points based on the input text, what measures were taken to ensure that it accurately reflects complex spatial relationships (like crossing intersections or entering specific areas) and also maintains a coherent connection to the semantic meaning of the instructions?
@노성래99
@노성래99 Күн бұрын
Thank you for the presentation. Q1: I believe that it is important that individual velocity vectors appear properly. I don't think denoising will be easy because individual vectors actually occupy a small size in the image. What special methods did the authors use for this? I'm curious about the authors' insights. Q2: I think you can specify start and goal by LLM model, not by Start/Goal Diffusion model. What do you think about this? What is the difference between using LLM and the method presented in the paper?
@정승재_teclados078
@정승재_teclados078 Күн бұрын
Thank you for you presentation. Q1. Since the paper generates simulations through various rules and settings due to the lack of real crowd simulation data, I am curious about how the authors evaluate whether the simulated data aligns with actual crowd behavior in diverse scenarios. As the simulated data itself is created according to the authors' intentions, shouldn’t the data used for training also be evaluated separately? Q2. The paper sets the crowd simulation with generic agent groups and their environment. However, there could be groups with distinct characteristics, such as idol fans, journalists, or holiday travelers, as well as random groups that wander leisurely instead of following the fastest path. Are there any studies related to these types of crowd behaviors?
@misong-kim
@misong-kim Күн бұрын
Thank you for your presentation. Q1. Is there a specific reason why the CLIP model was used in the paper? I’m curious if the CLIP model is also suitable for describing crowd movements. Q2. I’m wondering if the image-based representation is commonly used in crowd simulation. If there are other methods, what advantages does this approach have compared to them?
@홍성은-iiixr
@홍성은-iiixr Күн бұрын
Thank you for your presentation. I have two questions. 1. In 7:53, why is the last one constraints necessary? 2. When generating sentences, do you generate them in the same way? When the user directly provides text input, won't unfamiliar word combinations that were not used in learning be processed well?
@IIIXRLab
@IIIXRLab Күн бұрын
[Overall] The use of animations, masking, and highlighting to guide the flow of the presentation and keep the audience engaged has shown significant improvement compared to previous presentations. However, there is a noticeable lack of formality in the presentation slides. [Feedback] F1: Overall, the formatting of the PPT lacks professionalism. It is crucial to practice adhering to basic rules of spacing, indentation, and bullet usage. Slide 4: Inconsistent spacing between text and bullet Slide 8: Uneven spacing between bullets Slides 15-18: Inconsistent spacing after bullets Slide 20: Incorrect spacing after bullets After slide 23: Ongoing issues with spacing after bullets These formatting issues can undermine the presentation's credibility and professionalism, so it’s important to correct them. [Question] The current method of feeding "A large group enters from the top left entrance..." into CLIP and then using that with the diffusion model seems problematic. CLIP doesn't seem well-suited for this task. It raises the suspicion on does the velocity diffusion model merely uses text as a prior to determine direction and speed while relying on collision avoidance or velocity field management for the rest. This makes it seem unlikely that complex trajectories or sequential behaviors can be generated. It feels like the system is trying to suggest that crowd simulation is controlled via text, but in reality, LLMs act more like parsers, providing simple rules for a few groups, and then just replaying those movements. Hence, I have the following questions: Q1: Beyond simple movement patterns, can the system display sequential but different behaviors? For example, could a scenario be created where a group circles a fountain twice, then sits on benches, and repeats this behavior? Q2: Does this paper claim that using LLM has a contribution to their study? If so, what is that contribution? Q3: Is there any performance evaluation or mention of limitations of using CLIP in this work? Q4: Is this paper simply replaying predefined animations, rather than handling agents dynamically interacting within a scene? It seems like the focus is on exiting, rather than generating complex, continuous behavior for the agents.
@critbear
@critbear Күн бұрын
The crowd simulation is so interesting field. However, it seems to difficult to evaluate which study is better. Are ATD, SSR, and RSR used in quantitative analysis common evaluation methods in crowd simulation studies? If so, can you explain wht each is used in the evaluation?
@포로뤼
@포로뤼 Күн бұрын
Thank you for the presentation. It seems that velocity maps are generated for each group after splitting the entire sentence into groups. In this case, isn't there a possibility of collisions between the groups? I am curious if there is any specific handling for such cases.
@박주현-123
@박주현-123 Күн бұрын
Thank you for the presentation Q1: It appears that the input mainly focuses on describing the crowd's movement flow. Could you explain how the visual aspects of the maps and individual characters are designed? Q2: Is the CLIP model only used for embedding text prompts and not for images? If so, I wonder why they selected the CLIP model for embedding texts, when there exist other models that are designed specifically for text embeddings.
@한동현Han
@한동현Han Күн бұрын
Thank you for the presentation. I have two questions. Q1: Why do we have to use a diffusion model? What are the specific advantages of using it, and what are the alternatives? Q2: How can we train the policy to generate the velocity map? The input, output, model architecture, and reward function were not provided in the presentation. Could you briefly explain them again?
@SeungWonSeo-q3s
@SeungWonSeo-q3s 3 күн бұрын
Presentation was well received. I have a few questions and am leaving a comment here. Q1. How can the Denoised Velocity Field and Denoised Start/Goal Map be applied in a simulation environment? I’m unsure how these can be transitioned into the simulation stage. Q2. Is the velocity field typically represented using images in other fields as it was in this paper? Or are there other representation methods? If so, were there any experiments conducted on these alternative methods?
@김병민_dm
@김병민_dm 6 күн бұрын
Thank you for your seminar. I have two questions below: Q1. Is there a clear standard for dividing body parts in the part-based method, or is it expressed differently in each paper? If it differs from paper to paper, how do you think performance might vary depending on how the parts are divided? Q2. Aside from performance, are there any other advantages of dividing body parts compared to methods that do not divide them?
@박주현-123
@박주현-123 6 күн бұрын
Thank you for your presentation Q1. How are "parts" of a body given as input to a encoder? Specifically, what is the format of the input? Q2. How do you define the parts for the body? Is it defined manually by the users?
@tjswodud-c6c
@tjswodud-c6c 6 күн бұрын
Thank you for your presentation. I have two questions: 1. In the figure on p.17, I don't clearly understand how the starting points of the different agent groups are represented: do the green and red squares correspond to the starting points? 2. In the same page, is there any particular reason why the agent gropus have to follow the underlined constraints? I would appreciate a more detailed explanation on this. Thank you.
@노성래99
@노성래99 7 күн бұрын
Thank you for the presentation. I have two questions. 1) How does the part coordination block actually work? What I understand is a combination of transformer architecture and RNN style autoregressive on body part-wise processing. 2) I'm aligning with the assumption that part-wise generation of character animation results in better details. But is it goes ok with ambiguous input text (e.g. without a detailed description of body part)?
@RounLee0927
@RounLee0927 7 күн бұрын
I have two questions. Q1. I know data is very important in text-to-motion generation, but is it good at generating motion that isn't in the data? I wonder if the paper takes this into account. Q2. I'm wondering how the motions created by the parts are connected and made into a single motion.
@포로뤼
@포로뤼 8 күн бұрын
Thank you for your presentation. I have two questions. First, have there been any experiments where the body is divided differently from the six parts you mentioned? I'm curious if the division into six parts is the best approach. Second, it seems that the encoder and decoder are divided based on these six parts. However, I didn't quite understand how information is shared between them. Could you explain this part again and clarify it further?
@misong-kim
@misong-kim 8 күн бұрын
Thank you for your presentation. Q1: In the ParCo model, it is explained that the body is divided into a total of six parts to generate motion. I'm curious about why these specific six parts were chosen to divide the body-what were the criteria or reasons behind this decision? Additionally, I'd like to know what trade-offs exist between dividing the body into smaller, more detailed units versus larger units. Q2: When using a method like ParCo that generates motion by dividing the body into parts, I'm wondering whether it's possible to input specific textual descriptions for each part and have that more accurately reflected in the motion. In other words, does this approach help to implement detailed instructions for each body part more effectively?
@LRH_iiixrlab
@LRH_iiixrlab 8 күн бұрын
Thank you for your seminar. I have one question: What specific techniques or strategies does Parco utilize to avoid overfitting to certain motion patterns or repetitive text descriptions during the training process, especially when dealing with limited or biased datasets?
@황주영-j7y
@황주영-j7y 8 күн бұрын
Thank you for your seminar. 1. I think using 6 models will increase the number of model parameters, what do you think? 2. There seems to be no explanation for loss term, but do you want to use cross entropy loss for tokens or MSE loss for finished motion?
@IIIXRLab
@IIIXRLab 8 күн бұрын
[Overall] The overall flow and structure were easy to understand. However, the presentation lacked details in the implementation and architecture explanation. It would be more meaningful for the audience if there were more explanations about the types of input and output data and the data flow illustrated in the architecture diagrams. At present, it feels more like reading a blog that briefly introduces a new paper rather than attending a seminar. [Feedback] F1: Slide 9 contains too much information, and without proper guidance, it is difficult to understand as the explanations are only verbal. Why not use appropriate animations and annotations to aid comprehension? F2: For slide 16, it would be better to cut and show only a few parts or use strategies that display the images more prominently. Currently, the images are hard to see, and there is a lot of information in the illustrations, but there isn’t enough time taken to explain them thoroughly. F3: Regarding the "Three main architectures," does this refer to the VAE, Diffusion, and VQ-VAE architectures used in previous research? If so, why does the top title change from "Previous Works" to "What is text-to-motion generation?" The content of the presentation seems confusing. F4: As I pointed out to other presenters earlier, there is poor control over the amount of information provided on each slide. When there is a lot of information, sufficient explanations and additional guidance and annotations should be provided. There is no consideration of how the audience's gaze should move or the order in which understanding should occur. [Questions] Q1: What information did Parco authors use to render the stick man? Joint information? How did they determine the joint configuration? Q2: When are "part-aware motion discretization" and "text-driven part coordinating synthesis" used? What inputs and outputs are involved, and what do the numbers in the diagrams represent? Q3: Where is the "Part-Coordinated Transformer" used? Is it an expanded version of the transformer at the top of the "text-driven part coordinating synthesis"? Q4: What methods are used to maintain coherence between separately generated motions in Parco? I am curious about the specific training methods, inputs, outputs, and effects.
@critbear
@critbear 8 күн бұрын
I am curious about how the features processed for each part form one natural motion. Text and motion do not have a one-to-one correspondence, so multiple motions can be expressed as one text. Then, wouldn’t it be possible to express different motions for each part?
@SeungWonSeo-q3s
@SeungWonSeo-q3s 10 күн бұрын
Thank you for the excellent presentation. Q1. I am curious about how detailed the division of body parts should be. In the paper, it appears that the body is divided into six parts. However, I believe that the leg could be further divided into the upper and lower parts, such as the thigh and calf. While it seems possible to categorize body parts in more detail, I didn’t see any mention of this in the paper. Do you have any thoughts on this, or was it addressed in the research? Q2. I have a question about the MM-Dist metric. Could you explain in more detail how it is measured? I am particularly interested in how this metric evaluates whether the movement of each body part is appropriate when the name of that body part is mentioned in the text. How does this evaluation work?
@tjswodud-c6c
@tjswodud-c6c 10 күн бұрын
Thank you for your good presentation. I'm doing research on a similar topic, so your presentation was very helpful for me. I have two questions: 1. In the figures (p.3 and p.5) describing the previous works, it seems like "text" should go into the input of the encoder since it's a text-to-motion task, but in the figures it seems like only "motion sequences" go into the input. Is it a mistake to draw the figure? Or is it correct that the motion sequence goes in as input? I would appreciate a little more clarification. 2. In the process of breaking down the text into individual body parts through text-driven part coordinating synthesis, could you please explain in detail how the whole text is broken down into text for each body part? Is it focusing on specific words in the input text that describe the body part (e.g. “left hand”, “right foot”, etc.)? Thank you.
@한동현Han
@한동현Han 11 күн бұрын
Thank you for the presentation. I have two questions: Q1: I don’t understand how the Coordinate Networks work for synchronizing the separated generators. Where is the loss function? I can't understand how this model will be optimized according to the author's design intention. Q2: How is text translated into animation in this architecture? Why do we need to use the Encoder/Decoder architecture from this perspective?
@홍성은-iiixr
@홍성은-iiixr 10 күн бұрын
A1. Loss for synchronizing is not existed, just reconstruction for VQ-VAE and cross-entropy loss of text tokens and motion tokens for text-to-motion generation is used in this model. A2. This model is 2stage model. VQ-VAE is used for tokenizing motions, and CLIP is used for embedding text. Autoregressive style is used to for making motion tokens.
@정승재_teclados078
@정승재_teclados078 12 күн бұрын
Thank you for presentation. I have three questions. Q1. Is the model they are trying to train a skeleton model for motion? If so, is the target of learning the translation and rotation values of the joints? I’m a bit confused as the input of the encoder and output of the decoder are not clearly explained. Q2. In the part-based method, it seems like the parts being trained separately are at a coarse level, such as arms, legs, torso, and head. I don’t quite understand how dividing them at this level can achieve finer-grained and more precise motion generation compared to existing methods. Wouldn't defining the parts down to the bone level, as defined by the skeleton, allow for better learning and representation of more detailed and complex motion? Q3. Could you explain any limitations mentioned in this paper, or any limitations you personally perceive in this work?
@홍성은-iiixr
@홍성은-iiixr 10 күн бұрын
Thank you for your questions. A1. The model trains skeleton model for motion but it is not directly. Translation and rotation values of the joints are transformed to specific forms including like speeds of motions of parts. A2. Your question is great. Parco is also a better case, having moved from the existing 2-part creation to 6-part creation. But there are no research about bone level. A3. Because of autoregressive model, it is quite slow for inference.
@critbear
@critbear 15 күн бұрын
At some point, ChatGPT stopped generating false information, and I was glad to learn that it was thanks to this technology. Does this mean that with RAG, LLMs could eventually create their own wiki? Also, is there little difference in response time even when using RAG?
@SeungWonSeo-q3s
@SeungWonSeo-q3s 10 күн бұрын
A1. RAG operates by utilizing existing wikis or searching for information on the internet. Creating its own wiki is a slightly different topic. However, it is possible to store the output generated by RAG or LLM in a separate database and use it as a kind of memory. A2. The difference in response time depends on how large the document set being searched is. Since the documents are all pre-encoded, only simple vector operations consume time. With the development of various optimized retrieval systems, I believe that in many cases, the retrieval speed is relatively faster than the LLM’s generation speed.
@홍성은-iiixr
@홍성은-iiixr 15 күн бұрын
Thank you for your presentation. I have two questions. For RAG, it requires retrieval process. And its process is costly, then I think there are simple questions that no need RAG process. There are any research for classifying this? As I know, 'lost in the Middle' problem often occurs in RAG. Then how can solve this problem?
@SeungWonSeo-q3s
@SeungWonSeo-q3s 10 күн бұрын
A1. I understand that you are asking whether there is any research on a system that can determine if a given question requires retrieval using RAG. To enable such a determination, a classification model first needs to be able to assess whether RAG-based information retrieval is necessary to answer the question. I would like to introduce research that evaluates the need for dynamic retrieval based on query complexity: Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity. A2. One potential solution to the "Lost in the Middle" problem is to assign a rank to the retrieved documents. This could be an effective way to prioritize important information in the response.
@diturlr
@diturlr 15 күн бұрын
Thank you for great presentation. I have two questions. Q1) Referring train dataset or other static database is redundant referencing in theoretical view. What do you think making relevant raw data at inferencing a effective breakthrough of model performance? Q2) In generative models, any task may managed by generation (for example, Visual question answering for image classification) do you think refering real data will help majority of deep learning task, with integration of generative altering and RAG?
@SeungWonSeo-q3s
@SeungWonSeo-q3s 10 күн бұрын
A1. I agree with the idea of redundant referencing, but I believe it has great value in providing the information that needs to be focused on. Additionally, aside from improving model performance, an RAG system that explicitly shows what information was referenced can enhance the trustworthiness of AI technology from the user's perspective. To return to the main point, generating relevant raw data tends to reduce the stress on the generative model when performing complex tasks, and I believe this characteristic can improve the quality of the generation. A2. In my personal opinion, generative models are often forced to generate various answers from inadequate input conditions. As mentioned in A1, providing additional information through a retrieval system can prevent the model from generating incorrect responses and improve the quality of the output. However, I also believe there is a possibility that it could negatively impact the diversity, which is one of the main strengths of generative models.
@박주현-123
@박주현-123 15 күн бұрын
Thank you for your presentation. 1. Do popular LLM models such as ChatGPT utilize RAG methods as well? 2. How does the inner product of the two embeddings represent the similarity of the two documents? Is it related to how the model is trained?
@SeungWonSeo-q3s
@SeungWonSeo-q3s 10 күн бұрын
A1. I understand that ChatGPT can search for information on its own by using external plugins that allow it to access websites, among other things. A2. The inner product of two embedding vectors is related to the cosine similarity in vector space. A larger inner product indicates that the vectors point in similar directions, meaning there is a higher similarity between the document and the query.
@misong-kim
@misong-kim 15 күн бұрын
Thank you for your presentation. 1. Are there any methods to mitigate the issue of hallucination in RAG models? 2. Is there a problem caused by discrepancies between previous knowledge and the latest information? If so, what strategies are used to address this?
@SeungWonSeo-q3s
@SeungWonSeo-q3s 10 күн бұрын
A1. The hallucination problem is more likely to occur when relying on the parametric memory of LLMs, and RAG, which utilizes non-parametric memory from actual information, greatly helps to reduce the likelihood of hallucination. A2. To resolve the discrepancy between previous knowledge and the latest information, RAG can simply update the database or add new information.
@김병민_dm
@김병민_dm 15 күн бұрын
Thank you for seminar. I have a question about experiments: Through experiments based on Wikipedia data from 2016 and 2018, I understand that knowledge updates have been shown to work well. However, I am puzzled by why the accuracy of 2016 data would be lower when using a model based on 2018 information. Since Wikipedia is continuously updated and accumulative, I would expect the 2018-based model to handle 2016 data accurately. What could be the possible reasons for this discrepancy?
@SeungWonSeo-q3s
@SeungWonSeo-q3s 10 күн бұрын
A. I believe my explanation was lacking in this regard. The 2018 dataset is composed of a list of leaders from 2018, so it does not include the leaders from 2016, which is why these results occurred.
@RounLee0927
@RounLee0927 15 күн бұрын
Thank you for seminar. I have two questions. Q1. How can the text that the RAG model generates reflect the latest information? I'm wondering if it can respond to information that is updated in real time. Q2. I'm wondering how it interacts with the generation model to generate text, as it seems to rely heavily on the performance of the retriever.
@SeungWonSeo-q3s
@SeungWonSeo-q3s 10 күн бұрын
A1. There are two main approaches. The first is to directly search for real-time information and update the database. The second is to allow the model to search web pages for up-to-date information. A2. The documents retrieved by the retriever are added to the generator's input prompt. In this way, the retriever and generator interact.
@황주영-j7y
@황주영-j7y 15 күн бұрын
Thank you for introducing me to an interesting topic. I have a few questions. Q. Can I subtract top-k even if the answer to the question is not in the base document? Q. Is the encoders on page 20 pre-trained? Then, do you know how you learned it?
@SeungWonSeo-q3s
@SeungWonSeo-q3s 10 күн бұрын
A1. Even if the relevant document is not in the external knowledge base, you can still extract K documents because it finds the top documents based on similarity. A2. RAG uses a pre-trained bi-encoder from the paper "Dense Passage Retrieval for Open-Domain Question Answering."
@LRH_iiixrlab
@LRH_iiixrlab 15 күн бұрын
Thank you for the presentation. It seems that the specific differences between RAG and BART in reducing hallucinations weren’t fully covered in the seminar. Could you explain how RAG differs from BART in this regard? Specifically, how does the use of retrieved external knowledge in RAG contribute to reducing hallucinations?
@SeungWonSeo-q3s
@SeungWonSeo-q3s 10 күн бұрын
A1. RAG is a framework that combines a retriever and a generator, whereas BART is a language model composed solely of a generator. A2. As mentioned in A1, RAG has a retriever, allowing it to access real data and find relevant information. Since RAG references actual related information, this characteristic helps reduce the likelihood of generating incorrect answers.
@vvuonghn
@vvuonghn 15 күн бұрын
Hi, Could you please share your slide
@정승재_teclados078
@정승재_teclados078 17 күн бұрын
Thank you for the presentation. I have three questions. 1. I have a question about the retrieving process. When retrieving for a single query, does the model perform an inner product with the encoded vectors of all documents in the knowledge base? Since generalization is important in NLP, the RAG model must be trained on a large number of queries. Wouldn't this significantly increase the training time? 2. I understand that the difference between RAG-Sequence and RAG-Token lies in whether the top-k documents are selected at each iteration during training. However, I'm not clear about what exactly defines an iteration. Does iteration refer to a training iteration where the forward and backward processes are completed for each batch? Or is it the iteration that happens when tokens are generated one by one in an auto-regressive manner while creating an answer? Or does iteration refer to the decoder running multiple times to generate several candidate answers? I initially tried to understand RAG-Sequence and RAG-Token by considering iteration as a general training iteration, but then I couldn't fully grasp the point made on slide 34, where it says that RAG-Sequence faces issues because it doesn’t select documents at each token. Additionally, I am confused about what the term "token" refers to in each slide - does it refer to the tokens in the documents, in the query, or in the answer? 3. On slide 34, it is mentioned that RAG-Sequence encounters issues during inference. What exactly are these issues? In the subsequent example, is the problem related to inferring a missing probability as 0 when calculating the probability of a specific answer y1 conditioned on each document? Why is inferring it as 0 problematic? Isn't the absence of answer y1 in document z2 due to the low probability of that answer in that document?
@SeungWonSeo-q3s
@SeungWonSeo-q3s 10 күн бұрын
A1. Encoding each document every time would result in an impractical training time. Therefore, in RAG, the document encoder is fixed. All the data is pre-encoded, and during training, only the inner product of the query's encoded value is computed with the pre-encoded documents. A2. Iteration refers to the process where tokens are generated one by one in an auto-regressive manner, and each iteration references a different set of k documents. RAG-Sequence determines k documents once at the start, and during the auto-regressive process, it doesn’t search for a new set of k documents. Additionally, the term "token" in each slide intentionally refers to the tokens in the answer only. If there’s any confusing part, please let me know, and I’ll provide further clarification. A3. I believe you’ve understood correctly. As you mentioned, RAG-Sequence has cases where probabilities are missing, preventing the generation of a probability distribution needed for sampling. To address this, the missing probabilities are set to 0. It’s not that setting it to 0 is the problem, but rather there is an issue that leads to setting it to 0.
@tjswodud-c6c
@tjswodud-c6c 17 күн бұрын
Thank you for your great presentation. I have two questions. 1. In p.22, I'm not sure I understand that the top-k documents can be selected by taking the inner product of the two encoded vectors - there doesn't seem to be a k parameter in the formula, can you explain the process of selecting the top-k documents in more mathematical detail? 2. Does the document referenced by the RAG model have to contain only text, I was wondering if it can refer to images, videos, etc.? Thank you.
@SeungWonSeo-q3s
@SeungWonSeo-q3s 10 күн бұрын
A1. Although I mentioned retrieving K documents, what actually happens is selecting the top K documents with the highest scores. In other words, the documents with the highest inner product results between the two vectors are retrieved. A2. It depends on what kind of model is used for encoding the information. If a text encoder model is used, only textual information can be encoded. However, if an image or video model is used, it can reference a wider range of information, including images or videos.
@포로뤼
@포로뤼 20 күн бұрын
Thank you for the presentation. I have two questions. 1. Are there any issues associated with RAG-token? I’m curious about the advantages and disadvantages between RAG-token and RAG-sequence. 2. Could you introduce some cases where RAG has been applied in the graphics field?
@SeungWonSeo-q3s
@SeungWonSeo-q3s 10 күн бұрын
A1. The performance difference between the two methods depends on the task. I think the reason for this performance difference is that the RAG-sequence model references relatively fewer documents, which may explain its relatively higher performance in tasks like Open-domain Question Answering that require specific gold passages. On the other hand, RAG-token is more likely to reference a larger number of documents, which could explain why it achieves relatively higher performance in tasks that require aggregating information from diverse documents. A2. For research applying RAG in the graphics field, I recommend: 3d-gpt: Procedural 3D modeling with large language models.
@한동현Han
@한동현Han 21 күн бұрын
Thank you for the presentation. I have three questions. Q1: What if the knowledge related to the query is not contained in the knowledge base, and the retriever provides incorrect information? I think the retriever might actually hinder the generator in such cases. What do you think about this? Q2: How can we measure how much the generator relies on the retriever's output? Q3: How large should the knowledge base be to generate answers to general questions effectively?
@SeungWonSeo-q3s
@SeungWonSeo-q3s 10 күн бұрын
A1. Referring to the content of a related paper, the combination of the retriever and generator has the ability to generate answers that approximate the correct response by referencing a variety of passages, even in the absence of gold passages. If the knowledge needed for the answer is not included in RAG, you can simply add the necessary knowledge to RAG. While additional learning is required if the generator lacks the necessary knowledge, in RAG you only need to add the information. If the retriever retrieves incorrect information, the proposed structure cannot resolve this issue. This could be addressed by adding a module to verify whether the information retrieved by the retriever is correct. A2. The first method is to simply remove the retriever and use only the generator. The second method is to check whether the generator attention to the valid information from the retrieved documents. A3. The answer to this question varies significantly depending on the task. For complex tasks, such as solving difficult mathematical problems, extensive mathematical knowledge is required. However, for general tasks that arise in everyday life, basic common sense provided by the LLM may suffice, and large amounts of external knowledge may not be necessary.
@critbear
@critbear 21 күн бұрын
It feels unnatural to create a layout just by looking at the image. Wouldn't it be better to create a layout by considering the content of the layout? Or are there already such researches?
@김병민_dm
@김병민_dm 22 күн бұрын
Thank you for presentation. I have two questions: I'm curious about specific methods for providing a guide to a generative model. For example, does it automatically recognize an image you input and identify the positions of text and pictures, or is there another method? I understand that input modalities include text, code, and images, but what do "Knowledge" and "Science" mean in this context? Does it refer to giving input related to fields like medicine, or does it mean that it can generate content based on actual chemical formulas provided as input?
@노성래99
@노성래99 21 күн бұрын
Thank you for your questions. Q1: The model presented in the paper is a deep learning-based generative model, which automates the processing of various layout features using transformers, eliminating the need for manual processing. The image encoder automatically interprets the image, while the decoder generates the layout, making the entire process largely automatic. Q2: Indeed, RAG-based approaches can be applied to molecular generation and drug discovery, and there is an entire domain dedicated to this area. In terms of knowledge, it is typically addressed from the perspectives of knowledge discovery and data mining. For example, it is possible to predict credit card approval rates based on a person's social media followers/following and the content of their posts.
@SeungWonSeo-q3s
@SeungWonSeo-q3s 22 күн бұрын
Thank you for your presentation. I have two questions. Q1. I have a question about the Retriever Module. From what I understand, it seems that BASNet can be used to easily convert images into saliency maps. Since a saliency map represents regions that attract human attention, I think that retrieving based on the saliency map would reduce the chances of retrieving layouts that obscure or distract from the image. However, in the materials you presented, it seems that the retriever module only uses the image itself for retrieval. Is there a specific reason for this? Q2. I'm curious about why the cross-attention values, called feature fuse, are included in the Generator Module. Since f_I and f_L are already inputted, what hidden insights might the authors have had for also concatenating their cross-attention values (feature fuse) as input?
@노성래99
@노성래99 21 күн бұрын
Thank you for your questions. Q1: Saliency-based retrieval showed lower performance compared to image-based retrieval in the ablation study. I suspect this is because a saliency map can be considered a subset of the total features of a content image. Using the richer set of features from the full image likely leads to better results, as it captures more comprehensive information about the content. Q2: This approach is actually a conventional methodology in the deep learning field, primarily used for extracting and computing features from multiple perspectives. While it's clear that ​ image features and layout features will be processed through attention in the transformer decoder, introducing cross-attention during intermediate steps and then concatenating the results could have experimentally proven beneficial. This additional step may enhance the model's ability to integrate and leverage different types of features more effectively.
@IIIXRLab
@IIIXRLab 22 күн бұрын
[Overall] Your use of animations and images effectively engages the audience and makes the content more dynamic. The step-by-step flow also helps in maintaining the presentation's clarity. However, the presentation would benefit from several improvements in terms of voice intensity, lack of examples, and detailed discussions. [Feedback] F1: Improve audio clarity. The voice in the presentation is too quiet, making it difficult for the audience to follow along. Please consider recording with a higher volume or using better audio equipment to ensure your message is clearly heard. F2: Refine section headings. The current headings (e.g., Introduction, RAG, Experiment) are too broad and do not sufficiently define the content of each section. Consider creating more granular headings that better capture the specifics of what is being discussed in each part of the presentation. F3: Provide more examples. Including more examples, particularly of prompts and results, would greatly aid in understanding the performance of the model. For instance, similar to the example on slide 35, consider presenting additional examples that illustrate the RAG training and learning flow. [Questions] Q1: In the experiment, the RAG-based model is expected to show improved performance over the sole BART model due to its use of non-parametric memory as a prior. However, is it fair to highlight the RAG model's relative superiority over the BART model when it heavily relies on prior knowledge? Q2: Performance comparison between RAG and BART. From my perspective, it seems that the RAG and BART models show comparable performance in tasks like abstractive question answering and fact verification. Could you elaborate on the circumstances under which the RAG model outperforms/comparable to BART? Q3: Difference between RAG-Tok and RAG-Seq. What distinguishes the RAG-Tok model from the RAG-Seq model? Additionally, what factors contribute to RAG-Tok's superior performance? Q4: There is confusion around terms like "non-parametric knowledge" and "parametric knowledge." If this is meant to differentiate between parametric and non-parametric models, clarify this.
@SeungWonSeo-q3s
@SeungWonSeo-q3s 10 күн бұрын
A1. The comparison between the RAG model and the BART model is important to illustrate the limitations of using only a generator and to emphasize the necessity of a retriever. In 2020, methods combining retrievers and generators were not yet proposed, which is why this seemingly unfair comparison was necessary at the time. A2. Abstractive question answering is heavily influenced by the performance of the generation model, which explains the observed tendencies. In 2020, generation models had relatively lower performance and likely couldn't fully utilize the window length, resulting in incomplete use of related documents. Situations where the RAG model outperforms BART include cases requiring specific knowledge, access to the latest information, or problems that cannot be solved with general knowledge. A3. The difference between RAG-Token and RAG-Sequence lies in whether a new set of k documents is retrieved each time a token is generated. RAG-Token retrieves a document set at each iteration. This feature is useful when diverse information must be gathered from various documents, or when the K value is small relative to the length of the given question. In contrast, RAG-Sequence retrieves a document set only once, when generating the first token, which is why it performed relatively well in open-domain question answering tasks, where most problems can be solved with a single gold passage set. I believe there is no definitive answer regarding which method is better; it is crucial to choose appropriately based on the given task. Q4. The reason for using the terms "non-parametric memory" and "parametric memory" is that non-parametric memory refers to information stored in external databases, which is human-readable, while parametric memory refers to knowledge learned by the generation model. Terms like non-parametric memory, external databases, and external knowledge are used similarly. In future seminars, I will avoid using unnecessarily complex terminology.
@MisongKim-ki9nh
@MisongKim-ki9nh 22 күн бұрын
Thank you for your presentation. 1. Is the model described in the paper also generating text, or is it focused solely on layout generation? If it is only generating layouts, could you explain why the metric for evaluating the non-flatness of text elements is included in the evaluation? 2. Design preferences can vary greatly depending on individual tastes. While the paper focuses on quantitative evaluation, I'm curious to know if any user studies were conducted to complement these evaluations.
@노성래99
@노성래99 21 күн бұрын
Thank you for your questions. Q1: The model does not generate text directly; however, it does generate the category of each layout element (e.g., logo, text, underline). Non-flatness is measured specifically for layout elements predicted to belong to the text category. Q2: The paper did not conduct a user study. Additionally, design preference is a crucial factor when using generative models to create content where layout and planning play significant roles. While the evaluation of whether the model adheres to constraints in conditional generation can partially address this aspect, it is not yet fully comprehensive.
@정승재_teclados078
@정승재_teclados078 22 күн бұрын
Thank you for your presentation. Q1. Why was the latent-based method chosen among the four methods described on slide 7? The main difference between the other methodologies seems to be at which level the retrieved data is combined. It is unclear why combining at the image level or encoding each retrieved data individually would be less advantageous. Q2. Using simpler modules or processes is generally thought to increase processing time, model complexity, and model reusability. This paper uses various inputs, constraints, and pre-trained models to achieve high scores across multiple metrics. However, there seem to be too few ablation studies presented, making it unclear what role each module plays. Are there any ablation studies not covered in the presentation? For example, showing the performance difference between a model trained with only the image and retrieved data versus one that also uses a saliency map would help justify the use of the saliency map. Q3. Finally, if the model does not generate what content should be placed inside the layout, what is the meaning or practical use of layout generation? Are there fields that generate both the layout and the content that will fill the layout together?
@노성래99
@노성래99 21 күн бұрын
Thank you for your questions. Q1: It's unclear what exactly is meant by combining data at the image level, but from what I know, query-based RAG is typically used when the input and retrieved results can be seamlessly integrated (e.g., input: a short text condition, retrieved results: a detailed document, combined results: the text condition followed by the detailed document). While it might be possible to combine images without using an encoder, I am not aware of this being practiced in the field. The paper does not mention encoding individual data separately. Personally, I speculate that when multiple data points are retrieved, conditioning on the features that encompass these data points might provide a richer representation. Q2: As you mentioned, additional ablation studies were conducted, including an investigation into the fusion of features before passing them through the layout encoder. Some other minor ablation studies are also included in the supplemental material. Q3: Layout-based content generation (input: well-defined layout, output: image) is an area of active research. Additionally, image or image-based content generation conditioned on layout generally yields higher quality results compared to text-based or other forms of conditional generation.
@LRH_iiixrlab
@LRH_iiixrlab 22 күн бұрын
Thank you for your presentation, but I have some questions. It wasn't very clear to me why using RAG leads to better performance compared to other layout generation models. What specific aspects of RAG do you think contribute to this improvement? If there are points I might have missed during the presentation, please let me know.
@노성래99
@노성래99 21 күн бұрын
Thank you for your question. Q1: As mentioned on slide 5 of the presentation, layout generation often involves multiple possible layouts for a given input, and the model may struggle to determine which one to generate. In such cases, a layout retrieved from a database based on the existing content can serve as a guideline for the model, helping it make more informed decisions during the generation process.
@황주영-j7y
@황주영-j7y 23 күн бұрын
Thank you for the seminar. I have two questions. 1. What happens when the conditional input contradicts the ground truth? 2. Can the layout suitable for images be expanded to symbols (like pictograms) instead of text, or do more factors need to be considered?
@노성래99
@노성래99 21 күн бұрын
Thank you for your questions. Q1: During the training process, if there are issues with the dataset or if the conditional input constraints are intentionally made inconsistent with the ground truth (GT) data, the model's performance will degrade. Q2: While it is possible to use a single model to predict detailed content for each layout starting from an input image, a 2-stage approach is more practical. The problem with using a single model is that it would make the data excessively high-dimensional, leading to increased complexity. In a 2-stage approach, the process is divided into two steps: first, generating the layout, and then predicting the detailed content for each layout element. This approach simplifies the task and improves manageability.
@tjswodud-c6c
@tjswodud-c6c 23 күн бұрын
Thank you for your good presentation. I have 4 questions: 1. (In p.11) I was wondering what authors mean by token that the layout decoder predicts every frame. As I understand it, they mean each layout component (image, text, etc...), am I understanding you correctly? 2. It seems that the features that pass the layout encoder in the Retriever module enter the Generator module as input as f_L, where f_L is a concatenation of the features of “all” the layouts retrieved from the database? Or is it entering the Generator module in some other way? 3. The final output of the model seems to be of the form (c, x, y, w, h), what do c, x, and y stand for respectively? (I assume that w, h stand for the width and height of the layout). 4. I understand that you are working on a study on city generation. City generation and poster layout generation are perceived as two very different domains, do you have any prior studies of applying layout generation to city generation? Thank you.
@노성래99
@노성래99 21 күн бұрын
Thank you for your questions. Q1: Correct. More precisely, the model autoregressively predicts individual elements of the ground truth (GT) layout data (e.g., c1, x1, y1, …). This means that each element of the layout is generated one after the other in sequence, based on the previously generated elements. Q2: All layouts retrieved by the retriever module are passed through the layout encoder, and then concatenated. This concatenated representation is subsequently fused with the input image latent variable (as described on page 11). Q3: As described on page 2 of the presentation, x and y refer to the coordinates in a 2D space, representing the location of elements. C denotes the category of the layout element (e.g., Logo, text, underline, etc.). Q4: While the characteristics of the two domains are quite different, city generation fundamentally requires a city layout. Directly generating a 3D city remains challenging outside of procedural generation methods. In fact, published city layout generation papers share many similarities with layout generation. While layout generation involves determining the position and shape of rectangular layout elements with specific categories on a standardized rectangular canvas, city layout generation can be seen as an extended problem. It involves handling the positions and shapes of various building forms on a non-standardized canvas.
@박주현-123
@박주현-123 23 күн бұрын
Thank you for your presentation Q1. The model seems to return the bounding boxes within the input that designate where the layouts should be placed. Has there been any works on providing more high-level details for the layouts, such as the fonts or the style of expressions for each bounding box? Q2. Do you find the quantitative evaluation for a poster's layout appropriate? From my perspective, the concept of the "best design" can be subjective and should differ depending on individual preferences.
@노성래99
@노성래99 21 күн бұрын
Thank you for your questions. Q1: While some end-to-end frameworks generate layouts using existing layout generation methods and then proceed with content generation based on those layouts, most papers treat layout generation and layout-based content generation as separate tasks. Q2: Evaluating based on the "best design" has the inherent limitation that it can vary significantly due to individual preferences. Among the evaluation metrics mentioned, FID (Fréchet Inception Distance) is relatively agnostic to personal preference, as it quantitatively measures the similarity between the generated images and real images without relying on subjective judgment.