[Seminar] GSGEN: Text to 3D using Gaussian Splatting

Рет қаралды 159

Күн бұрын

Graduate Student Seminar
Speaker: SeungJeh Chung (teclados078@khu.ac.kr)
Topic: GSGEN: Text-to-3D using Gaussian Splatting
2024.07.31

Пікірлер: 27

@IIIXRLab Ай бұрын

Please review each point of feedback carefully, and respond with how you plan to address each one moving forward. [Overall] Content Balance: The presentation is intended to focus on GSGEN, yet too much time is spent on preliminary explanations of 3D Gaussian Splatting (3DGS). As a result, the explanation of GSGEN is lacking. Need for Presentation Guidance: There are several slides where it’s difficult to follow what is being explained. It would be beneficial to incorporate animations or use a pointer to guide the audience through the content. Lack of Custom Visualizations: The presentation relies heavily on images from the paper, which makes it challenging to explain and convey complex concepts. Creating your own illustrations to explain certain ideas would be more effective. For example, the current illustration for density control is insufficient for the audience to fully grasp the concept. [Feedback] F1: While the effort to explain 3DGS before diving into GSGEN is commendable and helps with audience understanding, it's important to adjust the balance of content. Since GSGEN is the main topic, the time spent on 3DGS should be reduced. Currently, 3DGS explanations extend to 9 minutes and 45 seconds. It would be better to cut this in half and allocate more time to explaining GSGEN. F2: The combination of images, script, and explanations is difficult to follow. It’s necessary to use a pointer to highlight the flow or add animations to guide the audience. For instance, on slide 10, the arrows point in multiple directions with explanations laid out in different colors, making it unclear in what order to read or understand the content. F3: The presentation assumes that SDS is well-understood, but it would not be difficult to provide a brief explanation. Additionally, when SDS is mentioned for the first time, the full name should be provided along with the abbreviation. It’s unclear why this was done on the second mention rather than the first. [Question] 1. Is the entirety of GSGEN’s contribution simply achieving text-to-GS using 2D SDS loss, Point-E, and 3D SDS loss? 2. Instead of asking us to look up the Adaptive Density Control section, please summarize the key points and explain them. If you’re unable to provide a concise summary, create a supplementary PowerPoint and share it among the lab members

@정승재_teclados078 Ай бұрын

[Feedback - Response] F1. I will provide an additional PPT explaining GSGEN on the lab's shared Notion page. F2, 3. I will pay more attention to visual elements, such as the use of arrows, and I will also focus on adding more detailed descriptions, including the explanation of abbreviations. Moving forward, I will include brief explanations of terms like SDS to ensure clarity. [Question - Answer] 1. If you refer to slide 19, you’ll see the main contribution and sub-contributions highlighted. The main contribution can be simply summarized as achieving text-to-GS using 2D SDS loss, Point-E, and 3D SDS loss. However, beyond that, this approach allows for achieving SOTA performance among existing text-to-3D generation methods, particularly in terms of rendering and training time. Additionally, it brings advancements in quality, as detailed in the sub-contributions. The achievement of SOTA performance in this regard is considered a significant contribution. 2. To summarize the key points: The Adaptive Density Control method uses a technique called "compactness-based densification" to enhance the quality and efficiency of the 3D model. This method begins by identifying the K-nearest neighbors for each Gaussian using a KD-Tree. If the distance between a Gaussian and any of its neighbors is smaller than the sum of their radii, a new Gaussian is inserted between them. This process helps to "fill in the gaps" and results in a more complete and accurate representation of the scene's geometry. In addition to this densification process, the method also includes a regularization step. This involves applying a weight to each Gaussian based on its distance to the center and periodically removing Gaussians with low opacity. This ensures that unnecessary Gaussians are pruned, allowing the model to focus computational resources on the most important areas.

@critbear Ай бұрын

When looking at the comparison results with other studies, it seems that GSGEN is particularly affected by lighting conditions. Is there a specific reason for this?

@LRH_iiixrlab Ай бұрын

Thank you for your presentation. Could you please elaborate on how GS Gen ensures 3D consistency during the optimization process by utilizing both 2D and 3D diffusion priors? Specifically, I am interested in understanding the role each type of prior plays and how their integration contributes to achieving a coherent and accurate 3D representation based on the input text prompt.

@정승재_teclados078 Ай бұрын

When using only a 2D prior for 3D generation, the process can become unstable because it tries to maintain 3D consistency based on multiple views of the 2D prior. This often leads to the Janus problem. The most common solution to this issue is to enhance 3D consistency by adding or integrating a 3D prior, such as 3D constructive information or constraints. GSGEN addresses this by utilizing the Point-E model, a 3D diffusion model, as its 3D constraint.

@노성래99 Ай бұрын

Thank you for the great presentation. I have two questions. 1. In order to use 3D sds loss, the noise-added representation of the scene must be passed to 3d diffusion model for each step, just like 2D sds loss. Than, what is the actual input of the 3D diffusion prior pipeline? (Is it Gaussian distribution of point cloud or something else)? Moreover, based on slides of presentation, it seems that point-e was applied only to 3d point cloud initialization. Wasn't it applied every step like 2d diffusion prior? Also, in the presentation, two types of sds losses are used, and are the sds losses obtained parallel to each training step and then summed? Or is one sds loss applied and then the other sds loss applied in take-turn manner? 2. Is point cloud supervision sufficient to solve the Janus problem? The coarse details inherent in the point cloud may be effective for the Janus problem, which deviates significantly from the three-dimensional spatial position of the output, but also for the texture-induced Janus problem? (e.g., can point cloud supervision solve the Janus problem of generating Wilson in the movie Cast Away?)

@정승재_teclados078 Ай бұрын

1. That's a great question. I acknowledge that the explanation of 3D SDS was quite condensed. The input for 2D SDS is the rendered image, whereas for 3D SDS, it's the noisy Gaussian positions. Point-E is not only used for 3D point cloud initialization but also serves as a 3D diffusion model to compute 3D SDS. In other words, if the model for the 2D prior is stable diffusion, then the model for the 3D prior can be considered Point-E. The losses for both priors are computed as a weighted sum and backpropagated. Details regarding the weights and training can be found in the implementation section of the paper. 2. 3D SDS operates with 3D constructive constraints, but it's important to consider that it doesn't just impose constraints from the point cloud perspective. It also incorporates the well-trained visual prior of the Point-E model. Reflecting on this aspect should help clarify the answer.

@misong-kim Ай бұрын

Thank you for your good presentation. In the original 3D Gaussian Splatting, the process starts with ellipsoids. Does GSGen proceed with optimization directly in the form of point clouds without converting them into ellipsoids, or is there a conversion process involved? If there is no conversion, what are the advantages or limitations of using point clouds compared to ellipsoids?

@정승재_teclados078 Ай бұрын

The stages not mentioned in the presentation, aside from the three stages discussed, are carried out in the same way as in the original 3D Gaussian Splatting (3D GS). In other words, the initialization of the 3D Gaussians is carried out in the same manner as in the conventional 3D GS approach.

@tjswodud-c6c Ай бұрын

Thank you for your good presentation. I was wondering about the application of Gaussian splitting in GSGEN. What are the main limitations currently faced by the 3D Gaussian splitting used by GSGEN? In particular, are there any problems or performance degradation during density control or text-based point cloud generation? Can you describe any research or technical approaches to overcome these limitations, if any, and what are your plans for the future development of the technology?

@정승재_teclados078 Ай бұрын

The following is an explanation of the limitations discussed in the GSGEN paper. SGEN tends to generate unsatisfying results when the provided text prompt contains a complex description of the scene or with complicated logic due to the limited language understanding ability of Point-E and the CLIP text encoder used in StableDiffusion. Moreover, although incorporating 3D prior mitigates the Janus problem, it is far from eliminating the potential degenerations, especially when the textual prompt is extremely biased in the guidance diffusion models.

@김병민_dm Ай бұрын

Thank you for your seminar. I have a following question: Why does the Janus problem occur? And how did previous research attempt to solve the problem?

@정승재_teclados078 Ай бұрын

In diffusion models, view-inconsistency arises because the model generates images by denoising random noise step-by-step, typically focusing on individual 2D slices or perspectives. Since the process lacks explicit 3D structural constraints or a unified 3D understanding, it can produce inconsistent results across different views. The solution involves adding 3D structural constraints or implementing a multi-view consistency loss to enhance the model's 3D understanding. In GSGEN, this issue was addressed by designing a 3D SDS loss.

@SeungWonSeo-q3s Ай бұрын

Thank you for the presentation. I have the following questions: - In 3D GS, are the parameters learned those of each 3D Gaussian distribution, or are they the color and opacity parameters that each Gaussian distribution possesses? I'm confused about what exactly 3D GS learns. - What makes 3D GS GPU-friendly? I understand that traditional learning-based methods are already considered GPU-friendly, so I'm curious about the specific reasons. Is it related to rendering? - There are various types of 3D representations. What are the disadvantages of 3D GS compared to other existing representation methods?

@정승재_teclados078 Ай бұрын

- You can clearly see this in the stage 4 point-based alpha blending equation. The parameters being learned are the distribution of the 3D Gaussian, the color (SH coefficients), and the opacity. The learned distribution of the 3D Gaussian is later combined with learned opacity to calculate the final opacity. This final opacity is then used in conjunction with the learned color to derive the final color. - Tile-based rasterization makes the process more GPU-friendly. This approach assigns a thread block to each tile, meaning that the calculations for all pixels within a tile can utilize shared memory. If you would like a more detailed explanation of using shared memory, please feel free to contact me directly. - The biggest drawback of 3D Gaussian Splatting (3D GS) is that as the scene becomes more complex and larger, the number of required 3D Gaussians increases exponentially, which can lead to potential memory shortage issues. However, this memory-related drawback of 3D GS has been addressed and mitigated in various advanced papers published later.

@박주현-123 Ай бұрын

Thank you for your presentation. How is the text "A Corgi" compared with the rendered images to compute the 2D SDS loss? It seems the text prompt would be converted to its embedded form using some sort of a text encoder, but what are its target comparison vectors and how are they generated from the images using diffusion models?

@정승재_teclados078 Ай бұрын

The text "A Corgi" is first converted into an embedding using a text encoder. In the forward process, noise is gradually added to an image. During the backward process, starting from pure noise, the diffusion model predicts and removes noise to generate an image aligned with the text embedding. The 2D SDS loss is then calculated by comparing the predicted noise (influenced by the text embedding) with the actual noise added during the forward process. The detailed formulas and processes are the same as the SDS loss in stable diffusion, so please refer to that.

@포로뤼 Ай бұрын

Thank you for the great seminar. I have two questions: - Why is the 3D Gaussian Splatting considered an explicit representation? - How was it possible to reduce the steps from 6 to 3 in GSGEN, compared to the original 3D Gaussian Splatting? I don't know how this reduction was achieved.

@정승재_teclados078 Ай бұрын

3D Gaussian Splatting is an explicit representation because it stores the scene's structure and appearance directly as Gaussian functions. This allows for real-time rendering, as the pre-defined splats can be immediately accessed and processed, unlike implicit representations that require on-the-fly computation through functions or neural networks. The three stages are modified, not reduced from six to three stages. Please review the video again for clarification.

@한동현Han Ай бұрын

It seems like the color scheme/palette of the generated 3D object looks similar. Very bright, neon-like, and a bit greenish. Can you explain why this happens?

@정승재_teclados078 Ай бұрын

In my opinion, since most diffusion-based 3D generation models aim for photorealistic results, it seems that elements like light and reflection are more emphasized.

@RounLee0927 Ай бұрын

I have two questions about this seminar. 1. What is the “motion” in 4:14 ? Does it mean that jsut the camera moving? 2. How can “text-to-3D generation” be evaluated quantitatively? How much has GSGEN improved performance compared to previous studies?

@정승재_teclados078 Ай бұрын

1. I'm not quite sure where motion appears. If you could be more specific, I'd be happy to answer. 2. Text-to-3D generation typically demonstrates its superiority through visual comparisons with other SOTA models rather than quantitative evaluations. While GSGEN is comparable to SOTA in terms of quality, it is overwhelming in aspects such as real-time rendering and shorter training time.

@황주영-j7y Ай бұрын

Thank you for the good seminar. I would like to ask you some questions to make sure I understand well. 1. I would like to ask if I understood well about the 3:01 image. In (a), I understood that NeRF put MLP in the position, theta, and phi of each point, so the values of color and alpha come out. Is it correct that the image space is displayed after being splatted on the object's point cloud?" 2. 4:32 You said "smoothly expanded" at each point, but how do you decide the direction of spreading? Is it determined by the formula in the picture? Please let me know if you understand it incorrectly. 3.I would like to know if GS methodology comparison, which includes 2, 3, and 4 processes, was conducted when diffusion was used in GSGEN thesis.

@정승재_teclados078 Ай бұрын

1. In the image, the key difference is that NeRF operates as a backward process, where rays are sampled at multiple points in space, and a neural network (MLP) predicts color and density at each point. This involves tracing the rays backward through the scene to accumulate the final image. In contrast, 3D Gaussian Splatting is a forward process, where pre-defined Gaussian splats are directly projected into the image space, contributing to the final image without the need for neural network evaluation at each sampled point. This makes 3D Gaussian Splatting more straightforward and efficient in generating the image. 2. Geometric 3D Gaussian can be understood in terms of how it's actually calculated: the scaling matrix determines the extent of the spread, while the rotation matrix determines the direction of the spread. In the source code implementation, the rotation is initially set to the identity matrix, but as optimization progresses, the direction of the spread is adjusted. 3. The three stages are modified, not reduced from six to three stages. Please review the video again for clarification.

@홍성은-s7r Ай бұрын

Thank you for your great presentation. Is there any research for text encoders in this area? In many research, I know that CLIP is usually used to text encoders. But there are lots of foundation models in VLM likes BLIP after CLIP paper.

@정승재_teclados078 Ай бұрын

In GLGEN, both the stable diffusion and Point-E models were utilized. For the implementation details regarding the structure of each text encoder and what specific text encoder was trained, please refer to the paper for more comprehensive information.