LLaVA 1.6 is here...but is it any good? (via Ollama)

Рет қаралды 16,397

Күн бұрын

Пікірлер: 42

@munchcup 11 ай бұрын

I find it easy that for more accurate results in text images to use pytessaract instead of llms but a description of an image llms serve well.Hope this helps.

@learndatawithmark 11 ай бұрын

Oh interesting. I noticed that ChatGPT was using pytessaract so perhaps they aren't even using GPT4-V when you ask it to extract text from images at all! Didn't think of that

@tiredofeverythingnew 11 ай бұрын

Thanks Mark great video. Loving the content lately.

@learndatawithmark 11 ай бұрын

Glad you liked it! Let me know if there's any other topics you'd like me to cover.

@JoeBurnett 11 ай бұрын

That arrow wasn’t pointing to the left as 1.6 indicated….

@learndatawithmark 11 ай бұрын

Hah, good catch! Dunno how I missed that :D

@utayasurian419 4 ай бұрын

I have a single image. But i would like to run multiple prompts using single image that shoukd already process by llavnext. Goal is to achieve llavanext to embed the same image over and over again. Any suggestions?

@rajavemula3223 4 ай бұрын

hey , i need help i am new with vlm's and i want a model that i can ask questions and descriptions about live cam feed .which model can be fit for me?

@thesilentcitadel 4 ай бұрын

Hi Mark, thanks for the video and also for sharing the code via your blog page. It occurred to me that I didn't quite understand how the source image is being manipulated under the hood when it is being used in the inference. For example, in the white arrow on a blue wall example, it seems that the image you used is 1000x667, but the supported resolutions for the model are indicated in three aspect ratios, up to 672x672, 336x1344, 1344x336. The code you used doesn't specify which aspect the image was, so am interested to understand this further. I wonder, if a single image, such as the one with the code screenshot would be better recognised if the consideration of the source image size and the compatible input image size was considered. I.e. Split the input image up or some such.

@learndatawithmark 4 ай бұрын

This is a really good question and I'm not sure of the answer. It doesn't seem like the Ollama library does anything to resize or truncate the images (at least as far as I can see) - github.com/ollama/ollama-python/blob/1ec88ed99466d6c232445714e4eabe99db0e166c/ollama/_client.py#L916 They talk about resizing images in the paper as part of the process, so perhaps that's how they handle large images - static.hliu.cc/files/llava/improved_llava.pdf

@geoffreygordonashbrook1683 11 ай бұрын

What size version of this model were you using? Have you compared variants such as Bakllava? What did you need to do to get open-AI to work? If you could show how to run various llava models on huggingface, e.g. from The Bloke, that would be swell. Many thanks for all the helpful videos and insights!

@learndatawithmark 11 ай бұрын

I think it was this one - 7b-v1.6. There are 13b and 34b versions too, but i haven't tried those yet. They've also got a bunch of others, some based on Mistral/Vicuna (ollama.ai/library/llava/tags) Not sure how different those ones would be - I need to give them a try! I did actually have a LLaVA 1.5 vs Bakllava example in progress but then stopped when I saw that there was a new LLaVA model out. I'll have to get back to it. Re: OpenAI - When I asked it to extract the text it kept throwing an exception when using pytessaract. So then I asked it what the code was doing (which it got right) and then I asked it to extract the code. And somehow that combination worked?!

@josefsteiner8616 11 ай бұрын

@@learndatawithmark i tried it with the 34b and i think it wasn't bad. i only had a screenshot from the youtubevideo of the diagram so the quality wasn't really good. maybe you can try it with the original image. that's the Answer: " The image you've provided appears to be a diagram illustrating the concept of data transformation and processing from a simple structure to a more complex one. On the left side, there is a basic representation with two boxes connected by an arrow, labeled "From this ..." This could represent data in its most raw or unstructured form, where information may not be processed or integrated into any system yet. On the right side, we see a more sophisticated diagram representing a network or a set of interconnected systems. There are multiple boxes connected with lines indicating relationships or data flow. Each box is labeled with various terms such as "Node A," "Node B," "Process," and "Service," which suggest that this represents a complex system where data goes through various processes and services before it reaches its final form. The arrow from the left to the right with the label "... to this!" implies that the data moves from a simple state on the left to a more structured or processed state on the right, possibly within a larger network of systems or as part of a workflow processing system. This could be used in educational materials to explain concepts such as data integration, data flow in complex systems, or the transformation process in information technology infrastructure. "

@PoGGiE06 9 ай бұрын

Great video, thanks.

@troedsangberg 9 ай бұрын

Comparing 7b models to ChatGPT is of course slightly misleading. I'm getting satisfactory results from 13b (fits in my GPU) and am quite happy using it for image captioning specifically.

@learndatawithmark 9 ай бұрын

Oh interesting. I tried 13b and 34b and was getting pretty similar results to what I showed in the video. Now you've got me wondering why I wasn't seeing better captions!

@Openfunnel 11 ай бұрын

Hey great content ! In your experience, how much is the performance difference between ollama version of the model (compressed) and its original version?

@learndatawithmark 11 ай бұрын

I tried the last few examples that didn't work well on the 7b model with the 13b and 34b models and I didn't see any better results. My impression is that this model is good with photos but struggles with other types of image.

@efexzium 9 ай бұрын

How can we use llava to control the mouse ?

@learndatawithmark 9 ай бұрын

To control the mouse? I don't think it can do that - why do you want it to do that?

@sovth_senter 8 ай бұрын

My thought is to overlay a grid on a screenshot of your desktop (write a function to take a screenshot and apply the grid when you send a prompt or whatever), then ask LlaVa to respond with the grid location closest to the spot you want to click. Clean that response and send it to pyautogui to move the mouse to the correct spot. Prompt engineer to taste.

@bennguyen1313 11 ай бұрын

I have pdf files of handwritten data that I'd like to OCR, perform calculations and finally edit or append the pdf with the results. I like the idea of using a Custom GPT, but only GPT4 Plus subscribers could use it. So I'd prefer a standalone browser or desktop solution, that anyone drag and drop a file into. However, not sure if ChatGPT4's API assistant has all the Vision / Ai PDF Plugin support. If using Ollama, would anyone who wants to use my application also need to install the 20GB Ollama?

@learndatawithmark 11 ай бұрын

You'd need to host an LLM somewhere if you want to create an application that other people can use. Unless you're having them run the app locally, I think it'd be better to use one of the LLM hosting services. Maybe something like replicate? replicate.com/yorickvp/llava-13b/api

@AndresSolar-y3g 9 ай бұрын

thx. very helpful. subscribed.

@Ravi-sh5il 8 ай бұрын

Hai am getting this: >>> /load llava:v1.6 Loading model 'llava:v1.6' Error: model 'llava:v1.6' not found

@learndatawithmark 8 ай бұрын

Did you pull it down to your machine? See ollama.com/library/llava

@Ravi-sh5il 8 ай бұрын

@@learndatawithmark Thank I forgot to pull :)

@Ravi-sh5il 8 ай бұрын

how to load the 23.7 GB lava-v1.6-34b.Q5_K_S.gguf ?,am currently having a 4.7 GB Can you please help me with this Brother.? Thanks in advance!

@learndatawithmark 8 ай бұрын

Are you getting an error?

@Ravi-sh5il 8 ай бұрын

@@learndatawithmark am unable to figure it outt as of how to load the 23GB file into ollama please help.give the command that can pull the 27gb

@Ravi-sh5il 8 ай бұрын

@@learndatawithmark Actuallly I dont know how to load the 23.7GB Llava on ollama

@annwang5530 8 ай бұрын

Hi can llava be integrated with groq?

@learndatawithmark 8 ай бұрын

I don't think groq have any of the multi modal models available at the moment. But there are a bunch of GPU as a service providers that keep popping up, so it should be possible to deploy it to one of them. One I played with a couple of weeks ago is Beam and now I kinda wanna see if I can deploy LLaVA there :D kzbin.info/www/bejne/jYqZnaKAa6mMeKM

@annwang5530 8 ай бұрын

@@learndatawithmark thanks man