V* - Better than GPT-4V? Iterative Context Refining for Visual Question Answer!

  Рет қаралды 333

John Tan Chong Min

John Tan Chong Min

Күн бұрын

Is V* really better than GPT-4V at Visual Question Answering (VQA)?
V* is a way to augment the prompt for Visual Question Answer (VQA) to be more than just the image and the question itself, but also a list of target objects that can help with the question and their positions.
This list of target objects can be found via a Visual Search Model. This Visual Search Model starts off with the full image and tries to find the target object's bounding boxes. If unable to find, it uses the heatmap which matches a contextual cue to the target object, and identifies a quadrant of the original image which the target object can be found in. The process then continues with this quadrant until we reach the minimum image size.
This iterative focusing of the image helps to mitigate the lack of positional sensitivity of the Vision Transformer embeddings for the image encoding of the multimodal Large Language Model (LLM).
In general, this approach of adding relevant context and searching the image by focusing on the right sub-sections is a very powerful one. I also show that if we incorporate some aspects of the V* method into GPT-4V, it can help improve the performance of GPT-4V!
~~~
References:
V* Github: vstar-seal.github.io/
My Slides: github.com/tanchongmin/Tensor...
Vision Transformer: arxiv.org/abs/2010.11929
CLIP embeddings: arxiv.org/abs/2103.00020
LLaVA: Large Language and Vision Assistant: arxiv.org/abs/2304.08485
GPT-4V Technical Report: arxiv.org/abs/2303.08774
Chain-of-thought: arxiv.org/abs/2201.11903
ReAct framework: arxiv.org/abs/2210.03629
~~~
0:00 Introduction
2:13 Key issue with CLIP embeddings based on ViT
13:31 Background: LlaVA
18:05 Overall Walkthrough of V*
31:12 Visual Search Model
39:52 Visual QA Model
43:33 Iterative Visual Search
45:05 V* Example 1
52:00 V* Example 2
59:26 A form of Best First Search
1:02:42 How to improve V* (Great discussion with Richard)
1:10:46 Putting Everything Together
1:13:38 Comparison: Chain of Thought
1:15:17 Comparison: ReAct Framework
1:16:53 Results
1:22:11 My experiments: Incorporating V* into GPT-4V
1:23:32 V* is actually less generic than GPT-4V
1:24:33 V* heuristic-based search based on heat map is similar to human fixation!
1:26:10 My takeaways
1:26:47 Discussion and Conclusion
~~~~
AI and ML enthusiast. Likes to think about the essences behind breakthroughs of AI and explain it in a simple and relatable way. Also, I am an avid game creator.
Discord: / discord
LinkedIn: / chong-min-tan-94652288
Online AI blog: delvingintotech.wordpress.com/
Twitter: / johntanchongmin
Try out my games here: simmer.io/@chongmin

Пікірлер: 3
@johntanchongmin
@johntanchongmin 6 ай бұрын
Correction: 13:32 For the LlaVA model, they did not use CLIP text embeddings (which is actually a GPT2 encoder). They just used the Vicuna-7B LLM and used the Multi-Layer Perceptron to project it into the same space as the vision tokens. This also means that there could be some projection error for the text encodings.
@johntanchongmin
@johntanchongmin 6 ай бұрын
Key timestamps 2:13 - 13:31 Key issue with CLIP embeddings based on ViT 18:05 - 31:12 Overall Walkthrough of V* 1:02:42 - 1:10:46 How to improve V* (Great discussion with Richard) 1:22:11 - 1:23:32 My experiments: Incorporating V* into GPT-4V
@johntanchongmin
@johntanchongmin 6 ай бұрын
Correction: For the localisation model for Visual Search model, it uses the target object name (not the contextualised cue). Only the heatmap (mask decoder of Segment Anything model) uses the contextualised cue to find the mask.
TaskGen - A Task-based Agentic Framework using StrictJSON at the core
1:51:37
John Tan Chong Min
Рет қаралды 1,2 М.
IQ Level: 10000
00:10
Younes Zarou
Рет қаралды 12 МЛН
Useful gadget for styling hair 🤩💖 #gadgets #hairstyle
00:20
FLIP FLOP Hacks
Рет қаралды 11 МЛН
SORA Deep Dive: Predict patches from text, images or video
2:10:13
John Tan Chong Min
Рет қаралды 312
Contextual and Semantic Information Retrieval using LLMs and Knowledge Graphs
28:41
TaskGen Ask Me Anything #1
1:54:55
John Tan Chong Min
Рет қаралды 397
Stop, Intel’s Already Dead! - AMD Ryzen 9600X & 9700X Review
13:47
Linus Tech Tips
Рет қаралды 1,2 МЛН
Intelligence = Sampling + Filtering
2:03:48
John Tan Chong Min
Рет қаралды 373
Michael Hodel: Reverse Engineering the Abstraction and Reasoning Corpus
1:28:17
Kenshi глазами новичка в 2024 году | Кенши
1:3:32
skibidi toilet multiverse 039 (final part)
5:10
DOM Studio
Рет қаралды 6 МЛН
КУПЛИНОВ УБЕГАЕТ ОТ РОДИТЕЛЕЙ ► SCHOOLBOY RUNAWAY #1
49:40