GPT-4 Vision: How to use LangChain with Multimodal AI to Analyze Images in Financial Reports

Рет қаралды 14,296

Күн бұрын

Пікірлер: 21

@chatwithdata 11 ай бұрын

Timestamps: 00:38 Demo of financial document 03:00 Multimodal architecture 17:30 Codebase walkthrough 31:35 Evaluating results using LangSmith 39:25 Showcasing results 41:25 Multimodal RAG problems and solutions

@ilianos 7 ай бұрын

for some reason, it's not showing as chapters in the video (when you scroll)

@TomanswerAi 11 ай бұрын

Crazy that I was struggling with this problem with a current project on gas boiler manuals. Then this comes out! I had a feeling vision would be the best way to deal with the PDFs of this type. Nice work.

@sylap 8 ай бұрын

You made that 1 step in my understanding of RAG!

@suvarnadhiraj 8 ай бұрын

Such and awesome content, thanks for putting this out, looking for follow-up videos on this particular usecase.

@ilianos 10 ай бұрын

🎯 Key Takeaways for quick navigation: 00:00 🎙️ *Introduction and Background* - The speaker introduces themselves, sharing their background at LangChain and in applied AI for self-driving cars. 00:41 🖼️ *Importance of Images in Documents* - Discussion on the significance of images in documents, with an example of a financial report blog containing tables and graphs. 02:04 🔄 *Options for Handling Images in RAG Applications* - Explains three approaches to incorporate images in RAG applications: using multimodal embeddings, summarizing images to text, and retrieving images based on summaries. 04:31 🤖 *Pros and Cons of Multimodal Embeddings* - Pros and cons of using multimodal embeddings for images in RAG applications, highlighting the potential but current limitations. 06:10 📊 *Retrieving Images Based on Summaries* - Introduction to the approach of retrieving images based on summaries, simplifying the retrieval process while maintaining the ability to utilize raw images. 08:04 🔍 *Retrieval and Answer Synthesis Example* - Example of retrieving a table image based on a question and synthesizing the answer using GPT-4V. 10:09 📈 *Complex Question Retrieval and Answer Synthesis* - Demonstrates a more complex question retrieval involving historical multiples, showcasing the detailed answer synthesis process. 13:23 ❌ *Challenges with Current Multimodal Embeddings* - Discusses challenges faced with using current multimodal embeddings, particularly in retrieving complex information like tables and charts. 14:03 📈 *Preliminary Evaluation Results* - Shares preliminary evaluation results comparing different approaches, with the summarization-based retrieval showing promise. 17:12 🧩 *Data Loading and Document Processing* - Addresses the importance of data loading and document processing, highlighting the use of tools like unstructured for categorizing text elements into tables and images. 19:00 📄 *Summarization of Text and Tables* - Demonstrates the text and table summarization process using GPT-4V and unstructured, emphasizing simplicity in obtaining summaries. 20:23 🖼️ *Summarization of Images* - Describes the process of summarizing images using GPT-4V, converting them into B64-encoded strings, and incorporating them into a vector store. 22:42 🛠️ *Building the Multiv Vector Retriever* - Explains the creation of a multiv vector retriever to handle text, tables, and images, enabling efficient retrieval and integration into the RAG pipeline. 23:07 🌐 *Introduction to Constructing Prompts for Text-Image Integration* - The challenge lies in constructing prompts for mixed text and image inputs. - Code processes mixed lists of text documents and images, carefully organizing them in the prompt. - The need for precise prompt construction due to differences in handling text and image inputs. 24:29 🖼️ *Retrieving Images and Text and Partitioning in the Final Prompt* - Retrieving images and text involves identifying image-encoded strings and placing them strategically in the prompt. - Partitioning the final prompt to accommodate both images and text for effective processing. - The importance of understanding the difference in handling images and text in the prompt construction. 25:22 📊 *Multimodal Retrieval: Embedding Summaries and Retrieving Raw Documents* - Overview of the retriever process that uses summaries for retrieval but returns raw documents. - Importance of embedding summaries and retrieving raw documents for effective multimodal retrieval. - The role of the retriever in handling mixed documents containing both images and text. 27:38 🌐 *Multimodal Embeddings vs. Summarization for Image Handling* - Exploring the limitations of multimodal embeddings for complex structures like tables in images. - Summarization's effectiveness in capturing nuanced information from complex images. - A comparison between direct image embedding and summarization for effective retrieval. 29:13 🔄 *Retaining Raw Information in Retrieval and Synthesizing Answers* - Retaining raw image and text information during retrieval to provide comprehensive answers. - The importance of embedding summaries for efficient retrieval while using raw data for final synthesis. - Ensuring the LM has access to all relevant information for generating detailed and accurate answers. 31:00 🔍 *Live Demonstration: Tracing the Retrieval and Answer Synthesis Process* - A live demonstration showcasing the retrieval and answer synthesis process using LangSmith. - Tracing the chain from the retriever to the LM, highlighting the raw documents involved. - Providing a visual understanding of how the system processes queries and produces detailed responses. 34:39 🧠 *Considerations and Nuances in Multimodal Retrieval* - Addressing considerations, such as the choice of Doc Store, for deploying multimodal retrieval in production. - Exploring the potential integration of arbitrary Doc Stores for flexibility in deployment. - Acknowledging challenges, such as optimizing image size and careful text chunking, in the practical implementation of multimodal retrieval. 39:29 🔄 *Recap and Overview of the Entire Workflow* - Summarizing the entire workflow from document input with images to final answer synthesis. - Highlighting key steps, including data loading, summarization, retrieval, and LM-based answer generation. - Emphasizing the interplay between different components in achieving successful multimodal retrieval and analysis. 46:38 🤔 *Challenges and Complexity in Retrieval Approach* - Retrieval approach is moderate to advanced, not for beginners. - Variability in optimal configurations for different document types. - Need for extensive testing to determine the best parameters. 47:32 🔄 *Retrieval Strategies and Considerations* - Retrieval involves selecting top relevant chunks based on user queries. - Use of larger text chunks with summarization for effective retrieval. - Balancing diversity and relevance in retrieval, addressing nuances with images. 49:25 🖼️ *Image Size and Chunk Size Considerations* - Importance of setting guidelines for minimum image size in applications. - Utilizing larger text chunks with summarization for effective retrieval. - Challenges in determining optimal chunk size and the experimental nature of the process. 51:55 🌐 *Separate Retrieval for Images and Text in Large Corpora* - Proposal to explore separate retrieval for images and text in large document sets. - Ensuring diversity across modalities in large document retrieval. - Addressing challenges of dealing with very large document corpora. 53:32 🔄 *Summarization of Text Chunks and Its Benefits* - Exploring the benefits of summarizing text chunks before embedding. - Avoiding embedding redundant or irrelevant text through summarization. - Improving retrieval quality by compressing document representations. 01:02:30 🎯 *Applying Metadata Filtering and Routing* - Compatibility of the approach with existing metadata filtering techniques. - Use of metadata filters for routing and agentic behavior. - Leveraging metadata to infer user intent and retrieve relevant information. 01:05:02 🗂️ *Handling Tables in Documents* - Evaluation of converting tables into images and potential challenges. - Importance of utilizing existing methods for table extraction. - Highlighting the potential complexity introduced by converting tables to images. 01:08:57 📊 *Evaluation Plans and Data Sets* - Active work on evaluation with LangSmith platform and public datasets. - Acknowledgment of the need for empirical evaluation to validate approaches. - Emphasis on gathering feedback from the audience and considering follow-up questions for future videos. Made with HARPA AI

@jacobschuster5261 11 ай бұрын

Thanks so much for this! I was looking how to make something similar in Langchain and this couldn't have come at a better time :)

@TomanswerAi 11 ай бұрын

Hit on an interesting topic and one I’ve been looking at. The point on separate process for image retrieval to guarantee images aren’t pushed out of top retrieved documents makes sense to me in certain circumstances. I’m currently looking at retrieval of info on home boiler manuals and prioritisation of images is important. I’d be interested to see if further work is done on that.

@Charlje 11 ай бұрын

Incredible video, extremely helpful. thanks both!

@edmald1978 6 ай бұрын

Why are the tables variable is empty in the notebook, even though cj.pdf contains tables? It seems this code isn't extracting the tables from the PDF properly.

@joezott 11 ай бұрын

Implied here is that the document being processed is a pdf. Have you looked at what you would change, or if you would, for compound file formats such as Word, PowerPoint where the text, tables, images, etc. are managed separately? For tables, have you considered using conventional image OCR to extract text from an image and compared the performance of that against or in addition to a text summary? My thinking was that often a summary may not include all the text in the table, but that some of the semantic meaning will be lost. For graphs (here I am thinking engineering documents with lots of graphs, sometimes requiring multiple graphs for a single analysis) have you considered the use of image processing to extract data values and then constructing a summary of each graph that incorporates the data. Also have you looked at, for example line, graphs with multiple data sets on the same graph / same image? For line drawings, block diagrams etc., have you found the approach you presented to be successful? Finally I occasionally run across situations where the text near the image contains semantic meaning tied to the image. Have you found methods where this text is not viewed separate from the image?

@ilianos 10 ай бұрын

following

@edmald1978 6 ай бұрын

This works with other models that are not ChatGPT?