I find it easy that for more accurate results in text images to use pytessaract instead of llms but a description of an image llms serve well.Hope this helps.
@learndatawithmark11 ай бұрын
Oh interesting. I noticed that ChatGPT was using pytessaract so perhaps they aren't even using GPT4-V when you ask it to extract text from images at all! Didn't think of that
@tiredofeverythingnew11 ай бұрын
Thanks Mark great video. Loving the content lately.
@learndatawithmark11 ай бұрын
Glad you liked it! Let me know if there's any other topics you'd like me to cover.
@JoeBurnett11 ай бұрын
That arrow wasn’t pointing to the left as 1.6 indicated….
@learndatawithmark11 ай бұрын
Hah, good catch! Dunno how I missed that :D
@utayasurian4194 ай бұрын
I have a single image. But i would like to run multiple prompts using single image that shoukd already process by llavnext. Goal is to achieve llavanext to embed the same image over and over again. Any suggestions?
@rajavemula32234 ай бұрын
hey , i need help i am new with vlm's and i want a model that i can ask questions and descriptions about live cam feed .which model can be fit for me?
@thesilentcitadel4 ай бұрын
Hi Mark, thanks for the video and also for sharing the code via your blog page. It occurred to me that I didn't quite understand how the source image is being manipulated under the hood when it is being used in the inference. For example, in the white arrow on a blue wall example, it seems that the image you used is 1000x667, but the supported resolutions for the model are indicated in three aspect ratios, up to 672x672, 336x1344, 1344x336. The code you used doesn't specify which aspect the image was, so am interested to understand this further. I wonder, if a single image, such as the one with the code screenshot would be better recognised if the consideration of the source image size and the compatible input image size was considered. I.e. Split the input image up or some such.
@learndatawithmark4 ай бұрын
This is a really good question and I'm not sure of the answer. It doesn't seem like the Ollama library does anything to resize or truncate the images (at least as far as I can see) - github.com/ollama/ollama-python/blob/1ec88ed99466d6c232445714e4eabe99db0e166c/ollama/_client.py#L916 They talk about resizing images in the paper as part of the process, so perhaps that's how they handle large images - static.hliu.cc/files/llava/improved_llava.pdf
@geoffreygordonashbrook168311 ай бұрын
What size version of this model were you using? Have you compared variants such as Bakllava? What did you need to do to get open-AI to work? If you could show how to run various llava models on huggingface, e.g. from The Bloke, that would be swell. Many thanks for all the helpful videos and insights!
@learndatawithmark11 ай бұрын
I think it was this one - 7b-v1.6. There are 13b and 34b versions too, but i haven't tried those yet. They've also got a bunch of others, some based on Mistral/Vicuna (ollama.ai/library/llava/tags) Not sure how different those ones would be - I need to give them a try! I did actually have a LLaVA 1.5 vs Bakllava example in progress but then stopped when I saw that there was a new LLaVA model out. I'll have to get back to it. Re: OpenAI - When I asked it to extract the text it kept throwing an exception when using pytessaract. So then I asked it what the code was doing (which it got right) and then I asked it to extract the code. And somehow that combination worked?!
@josefsteiner861611 ай бұрын
@@learndatawithmark i tried it with the 34b and i think it wasn't bad. i only had a screenshot from the youtubevideo of the diagram so the quality wasn't really good. maybe you can try it with the original image. that's the Answer: " The image you've provided appears to be a diagram illustrating the concept of data transformation and processing from a simple structure to a more complex one. On the left side, there is a basic representation with two boxes connected by an arrow, labeled "From this ..." This could represent data in its most raw or unstructured form, where information may not be processed or integrated into any system yet. On the right side, we see a more sophisticated diagram representing a network or a set of interconnected systems. There are multiple boxes connected with lines indicating relationships or data flow. Each box is labeled with various terms such as "Node A," "Node B," "Process," and "Service," which suggest that this represents a complex system where data goes through various processes and services before it reaches its final form. The arrow from the left to the right with the label "... to this!" implies that the data moves from a simple state on the left to a more structured or processed state on the right, possibly within a larger network of systems or as part of a workflow processing system. This could be used in educational materials to explain concepts such as data integration, data flow in complex systems, or the transformation process in information technology infrastructure. "
@PoGGiE069 ай бұрын
Great video, thanks.
@troedsangberg9 ай бұрын
Comparing 7b models to ChatGPT is of course slightly misleading. I'm getting satisfactory results from 13b (fits in my GPU) and am quite happy using it for image captioning specifically.
@learndatawithmark9 ай бұрын
Oh interesting. I tried 13b and 34b and was getting pretty similar results to what I showed in the video. Now you've got me wondering why I wasn't seeing better captions!
@Openfunnel11 ай бұрын
Hey great content ! In your experience, how much is the performance difference between ollama version of the model (compressed) and its original version?
@learndatawithmark11 ай бұрын
I tried the last few examples that didn't work well on the 7b model with the 13b and 34b models and I didn't see any better results. My impression is that this model is good with photos but struggles with other types of image.
@efexzium9 ай бұрын
How can we use llava to control the mouse ?
@learndatawithmark9 ай бұрын
To control the mouse? I don't think it can do that - why do you want it to do that?
@sovth_senter8 ай бұрын
My thought is to overlay a grid on a screenshot of your desktop (write a function to take a screenshot and apply the grid when you send a prompt or whatever), then ask LlaVa to respond with the grid location closest to the spot you want to click. Clean that response and send it to pyautogui to move the mouse to the correct spot. Prompt engineer to taste.
@bennguyen131311 ай бұрын
I have pdf files of handwritten data that I'd like to OCR, perform calculations and finally edit or append the pdf with the results. I like the idea of using a Custom GPT, but only GPT4 Plus subscribers could use it. So I'd prefer a standalone browser or desktop solution, that anyone drag and drop a file into. However, not sure if ChatGPT4's API assistant has all the Vision / Ai PDF Plugin support. If using Ollama, would anyone who wants to use my application also need to install the 20GB Ollama?
@learndatawithmark11 ай бұрын
You'd need to host an LLM somewhere if you want to create an application that other people can use. Unless you're having them run the app locally, I think it'd be better to use one of the LLM hosting services. Maybe something like replicate? replicate.com/yorickvp/llava-13b/api
@AndresSolar-y3g9 ай бұрын
thx. very helpful. subscribed.
@Ravi-sh5il8 ай бұрын
Hai am getting this: >>> /load llava:v1.6 Loading model 'llava:v1.6' Error: model 'llava:v1.6' not found
@learndatawithmark8 ай бұрын
Did you pull it down to your machine? See ollama.com/library/llava
@Ravi-sh5il8 ай бұрын
@@learndatawithmark Thank I forgot to pull :)
@Ravi-sh5il8 ай бұрын
how to load the 23.7 GB lava-v1.6-34b.Q5_K_S.gguf ?,am currently having a 4.7 GB Can you please help me with this Brother.? Thanks in advance!
@learndatawithmark8 ай бұрын
Are you getting an error?
@Ravi-sh5il8 ай бұрын
@@learndatawithmark am unable to figure it outt as of how to load the 23GB file into ollama please help.give the command that can pull the 27gb
@Ravi-sh5il8 ай бұрын
@@learndatawithmark Actuallly I dont know how to load the 23.7GB Llava on ollama
@annwang55308 ай бұрын
Hi can llava be integrated with groq?
@learndatawithmark8 ай бұрын
I don't think groq have any of the multi modal models available at the moment. But there are a bunch of GPU as a service providers that keep popping up, so it should be possible to deploy it to one of them. One I played with a couple of weeks ago is Beam and now I kinda wanna see if I can deploy LLaVA there :D kzbin.info/www/bejne/jYqZnaKAa6mMeKM
@annwang55308 ай бұрын
@@learndatawithmark thanks man
@tapos9998 ай бұрын
which mac it was running on?
@learndatawithmark8 ай бұрын
2022 Mac M1 with 64 GB RAM
@GaneshEswar5 ай бұрын
Amazing
@KayleenFrankenberry4 ай бұрын
574 Tromp Corners
@xiaofeixue70018 ай бұрын
Is this the paid version of ChatGPT?
@learndatawithmark8 ай бұрын
Yes that's GPT-4. I don't think you can upload images to GPT-3.5?
@varadhamjyoshnadevi154511 ай бұрын
HI I have few doubts on openai . Could you pls share you mail id ?