AI Vision Models Take a Peek Again!

  Рет қаралды 10,591

Matt Williams

Matt Williams

Күн бұрын

Пікірлер: 119
@Pregidth
@Pregidth 23 күн бұрын
As far as I can see the model works really good in Open webUI, It gives me quite accurate answers and I would like to thank the Ollama team for setting this up. Really cool!
@MM-24
@MM-24 Күн бұрын
are you using 11b ? What hardware?
@ihaveacutenose
@ihaveacutenose 22 күн бұрын
Matt Williams is the best of the Matts! 😆
@CoolWolf69
@CoolWolf69 22 күн бұрын
I just downloaded and tried this model by myself (running Open WebUI in dockge with Ollama in a separate LXC container on Proxmox with a 20GB Nvidia RTX 4000 Ada passed through). I was flashed by the accuracy of the pictures being recognized! Even the numbers shown on my electricity meter's display were identified correct. I am really impressed - especially about the correct guess of the age of my son on some pictures were the model suggested him being 10-12 years old ... with him being indeed 11 years. WOW!
@mbottambotta
@mbottambotta 23 күн бұрын
thanks Matt for your clear explanations. it's hard to separate the wheat from the chaff on KZbin, but with your help, that's what I manage to do.
@BORCHLEO
@BORCHLEO 18 күн бұрын
This is an awesome tutorial matt! you are definitely instilling solid foundations into these videos and I love it! keep up the amazing work! Have a great week!
@Theausomecaleb
@Theausomecaleb 2 күн бұрын
Glad i found your channel. Most AI channels have no information or just contain outright lies. You sir earned a subscribe. Thanks for the quality work.
@AwakeNotWoke
@AwakeNotWoke 23 күн бұрын
I tried it this morning and was quite surprised (the smaller version). It was a cartoon caricature of a blind man on a galloping horse ( a saying my father-in-law uses). the description it gave was surprisingly good. It didn't get some of the details and misread the blind man's facial expression ie it saw fear as having a great time. You videos are extremely helpful. I appreciate the effort.
@fabriai
@fabriai 23 күн бұрын
Good video. Thanks Matt. By the way, you are my favorite Matt, but don’t let the others know. You’re the one with the best videos and the best taste for shirts. Now seriously, your content is truly technical and educational. Thank you.
@sammcj2000
@sammcj2000 23 күн бұрын
Love the shirt Matt! It's fun 😊
@technovangelist
@technovangelist 23 күн бұрын
Its boldness almost reminds me of the shirt Theo asked his sister to make for him (played by Lisa Bonet) on the Cosby Show decades ago. Though that shirt ended up being a disaster.
@technovangelist
@technovangelist 22 күн бұрын
Huh?
@Alex-os5co
@Alex-os5co 12 күн бұрын
Thanks so much Matt. Just to note to I didn't have any luck by dragging the photo into terminal if not using the local machine. I too have a M1 Mac 64GB ram and it worked perfectly. However when trying on my LLM server (2 x 3090) it didn't work - both running same Ollama version (except linux for llm server) and model.
@therajram
@therajram 23 күн бұрын
Love your videos and course, which I am currently following. Keep it up. ❤️ from London.
@technovangelist
@technovangelist 23 күн бұрын
And as you can tell from my accent, I was born in London....Kingsbury to be exact. And no, most can't tell. But it was nice for the 10 years I lived in Amsterdam to have an EU passport (pre Brexit)
@therajram
@therajram 23 күн бұрын
@ Well if you’re ever in London then let me buy you a warm beer.
@technovangelist
@technovangelist 23 күн бұрын
There will definitely be some visits. I want my daughter to meet more of her extended family
@INVICTUSSOLIS
@INVICTUSSOLIS 21 күн бұрын
That means the Ollama nodes in comfyUI will be fantastic to use with this vision model. Downloading it now to play with it.
@autoboto
@autoboto 22 күн бұрын
Biggest issue using the 11B is getting info of where words or whatever other shapes are found on the image file. In particular x,y coordinate or even better and width,height of the word block. So far ocr/opencv solution gives all this info in a second. To describe the image as a whole seems to be where the vision model works best with or without words found.
@remedyreport
@remedyreport 23 күн бұрын
Top Notch production as always!🎉
@TheColonelJJ
@TheColonelJJ 22 күн бұрын
I love to feed an image to an LLM and ask it to creat a text to image prompt from what it sees.
@samibilal
@samibilal 23 күн бұрын
I would like a vision model to be able to assist in grading handwritten exam questions. (This is a thing in my country, not sure if it is continued in the west). A good assistance function would be the capability to "average" handwritten answers to a particular knowledge or essay question. So not just OCR'ing answers, but comparing all the answers in some form against the perfect answer perhaps, and seeing which sub-element of the answer was right. A question related to social sciences such as literature would be particularly difficult, but useful.
@tyanite1
@tyanite1 22 күн бұрын
@samibilal It is still a thing in the U.S. as well. This is a fantastic idea. I bet a fine tuned model for grading could do it.
@Derick99
@Derick99 22 күн бұрын
I am going to try it here soon but ive been trying to make a tesseract ocr script but been failing due to borders that are too close to letters and slight variations of where the text falls. I can give it to chat gpt and it can read it perfect but mine is to inconsistent to actually use! What would you suggest or how would you personally build a ocr tesseract office program that can read and detect invoices and there contents and company names and regular structure for invoices from different but consistently the same group of different company invoices
@technovangelist
@technovangelist 22 күн бұрын
I haven’t worked with tesseract or any other open source ocr. But have heard lots of folks have issues with
@Derick99
@Derick99 22 күн бұрын
@technovangelist yea it's a bugger to make reliable, printer paper quality and ink and any wrinkles or slight misplacement throws boundaries off and can catch half of stuff causing weird symbols any suggestions?
@andrepaes3908
@andrepaes3908 23 күн бұрын
Thanks Matt for the great video! I also have tried the vision model on ollama and got results close to yours. Can you clarify the context size? You told previously all ollama models are capped at 2k tokens except embedd models capped at 8k. What about this vision model? I saw in the model file its context size is 128k. Is it capped at 2k or it uses 128k? Thanks in advance for replying!
@technovangelist
@technovangelist 23 күн бұрын
Doh! I just made that video and forgot...so yeah, it will be 2k until I change it. Its not capped, because its easy to change.
@joeburkeson8946
@joeburkeson8946 19 күн бұрын
The 11b model runs fine inside of open webui on a 12 gb rtx3060. I will be exploring ancient symbols to modern art in an open discussion format, thanks Matt.
@kirilgenov7437
@kirilgenov7437 9 күн бұрын
Hi Matt! Great video! Do you think AI vision models can be used to annotate image data? Like providing box coordinates for object detection on a given image?
@VincentDeLaCroix-p2z
@VincentDeLaCroix-p2z 22 күн бұрын
Why when I try to make a website in Ollama I can't make it to make me an entire php page but Chatgpt and Claude Can ?
@t.farias9336
@t.farias9336 23 күн бұрын
Is there currently any vission-capable ollama model suitable for ordinary domestic PCs?
@technovangelist
@technovangelist 23 күн бұрын
I just showed one. As long as it has a decent GPU you are set. Or an Apple Silicon Mac is perfect too. I would say both are very ordinary machines these days
@Sailing_Antrice
@Sailing_Antrice 23 күн бұрын
I would love to try ollama and Llama3.2:11B-Vision-Instruct and 90B too but the Ollama code won’t compile properly on my Nvidia Orin AGX 64Gb it won’t detect the CUDA device during the compile and installation process. If anyone has a fix I would love to know.
@imorganmarshall
@imorganmarshall 23 күн бұрын
I look forward to trying this. I've been getting great results with LLava already.
@AmrAbdeen
@AmrAbdeen 23 күн бұрын
is there a local model that you consider "good" for vision?
@technovangelist
@technovangelist 23 күн бұрын
Good? This one. It seemed to work well for most things
@AmrAbdeen
@AmrAbdeen 23 күн бұрын
@@technovangelist thanks for answering
@beratyilmaz7951
@beratyilmaz7951 23 күн бұрын
Sometimes it repats the same ouput again and again(11b). I set penalty but didn‘t work. Do you have any other suggestion?
@technovangelist
@technovangelist 23 күн бұрын
Did you update the modelfile? use modelweights from HF? Those are behaviors I expect from a model with a bad prompt. How much vram do you have?
@augmentos
@augmentos 22 күн бұрын
Enjoyed this sub'd a while back and will put bell on, so long and until I hear you call Llama Open Source
@Computer-v5e
@Computer-v5e 18 күн бұрын
Beautiful
@MeinDeutschkurs
@MeinDeutschkurs 23 күн бұрын
I‘m confused, Matt, on how to use it correctly. The goal is to ask iteratively Questions about the image. What are the best practices? The documentation implies to submit the path to the image. Other models need a base64 string. I cannot find working examples. (Python library)
@luizgustavs
@luizgustavs 23 күн бұрын
Very nice video, i was waiting for Llama Vision on Ollama since the release of the models, i would love to see support for Pixstral as well
@radhat_
@radhat_ 23 күн бұрын
I have a question. I have a M1 MacBook with 16Gb of ram... is even the smallest model too heavy for the machine or am I doing something wrong?
@miloldr
@miloldr 23 күн бұрын
I have similar setup, I can run. 7-8b models but at quite slow speed, so I think 11b vision would be too much. I suggest try groq playground if it's only for testing.
@technovangelist
@technovangelist 23 күн бұрын
Its not hard to try, but I would imagine it will be difficult
@miloldr
@miloldr 23 күн бұрын
I tried llama 3.2 90b vision on groq (API provider) and it 99% of the time refused to describe image for "security purposes" or such and the rest 1% is getting it completely wrong.
@technovangelist
@technovangelist 23 күн бұрын
I don’t see why folks are excited with groq. It’s fast but fails to work on so many things.
@miloldr
@miloldr 23 күн бұрын
@technovangelist I don't understand what you mean, it's just a fast api for llama models and few others. Though I did see some difference from other hosts in following instructions.
@technovangelist
@technovangelist 23 күн бұрын
What I mean is that the fail rate for groq is pretty high. I hit limits on it all the time. I tend to not waste my time with it.
@miloldr
@miloldr 23 күн бұрын
@technovangelist The base models have big limitis, but the instruct or other modified models have higher limits. Though groq made those limits smaller, from 20k to 8k for 70b veristaile. But honestly I barely anytime hit even such limit but I'm slowly switching to samba nova (mostly for 405b and speed)
@mariozulmin
@mariozulmin 22 күн бұрын
@@technovangelist same for me, i rather pay some for an actual working api, there are good and very well working ones out there. My Environment cannot handle like 70,90b so… Thanks for this Video, always like your style and demos, subbed! Cheers from AT 🙋‍♂️
@neoprints1325
@neoprints1325 22 күн бұрын
We want to use this Model to control Robot arm can you guide us?
@IvarDaigon
@IvarDaigon 23 күн бұрын
You need to get molmo running in llama, it is severely underrated and probably one of the best models for automation tasks.
@r3kRaP
@r3kRaP 22 күн бұрын
@@IvarDaigon agreed, that pointing feature op
@raviramanathan5565
@raviramanathan5565 23 күн бұрын
Thanks for the update. I have a question on the best way to run local AI models (with low resource settings) Is the new Mac Mini base model good for these? Mx Pro/Max are expensive machines. Will a PC with xx90s do? Pointers will be helpful with some sort of budgeting. much confusions with Apple MLX/ Ollama etc
@technovangelist
@technovangelist 23 күн бұрын
So a PC with a good Nvidia card will go faster but will be more expensive than the comparable Mac AND use a lot more power. A mac will be a good machine for you for 10-15 years. The shortest life span for any Mac I have owned was 8 years. I would recommend getting at least 32GB RAM on the Mac and at least 1TB disk.
@s.chandrasekhar8290
@s.chandrasekhar8290 23 күн бұрын
¡Gracias!
@technovangelist
@technovangelist 23 күн бұрын
That is so nice. Thank you so much!!
@RomuloMagalhaesAutoTOPO
@RomuloMagalhaesAutoTOPO 23 күн бұрын
😀👍THANK YOU MATT.
@mbarsot
@mbarsot 23 күн бұрын
So my use case is professional, we have a bunch of procedures that we try to use with rag to answer questions Problem is there are many screenshots so the test is finding the right prompt to ask vision model to create a text description of what is in the image
@kepenge
@kepenge 23 күн бұрын
@Matt, loved your explanation and conducted tests... do you think that ollama in feature would try MLX? If so, do you think it would increase performance?
@technovangelist
@technovangelist 23 күн бұрын
I don't know. LMStudio added it and it is now marginally faster on limited models. May be a lot of work for not a lot of benefit. But it's hard to know what the team is going to do. I have been gone longer than I was part of the team after we pivoted to building Ollama.
@pokemonmaster2541
@pokemonmaster2541 23 күн бұрын
Sir, Your video was awesome and very informative. I need a suggestion from you, sir. I have tested Ollama's new release with the LLaMA 3.2 Vision 11B model, but it’s not working on my GPU. I tested some other models between 11B and 16B on the same device, and every model except LLaMA 3.2 Vision utilizes the GPU. However, the LLaMA Vision model is only running on the CPU. Could you suggest any way to run this model on the GPU, like the others? I'm using an NVIDIA 3050 with 4GB VRAM, updated drivers, and the latest version of Kali Linux OS.
@technovangelist
@technovangelist 23 күн бұрын
That’s an easy one to solve. You need a better gpu with more memory.
@sMadaras
@sMadaras 23 күн бұрын
Do you drink at the end of the videos to draw attention to the importance of hydration? You are absolutely right, I support your mission :) Cheers!
@technovangelist
@technovangelist 23 күн бұрын
I did it in a few videos at the beginning and just kept doing it. Some folks really like it and comment when I leave it out.
@AlfredNutile
@AlfredNutile 23 күн бұрын
I wanted to see what you think about this technically. I was doing a proof of concept around a rag system and research documents that had a lot of charts images and tables. I decided to covert the pdf pages to images and then ask the Llm to pull all the text out of it and to summarize the details about any chart or graph adding that to the output. This seem to do better than the langchain api we had been using to extract the data. Then the second part is I asked the LLM to return JSON representing the chunking of the data breaking it up not by word count but by meaning to see if that would do better than chunking by size with overlap. Anyways it went well on this small POC just curious about your thoughts on this type of process. I know pricing is higher for vision but this is just a couple of hundred documents without a ton of changes over time.
@technovangelist
@technovangelist 23 күн бұрын
I would save as images and use a good traditional ocr tool for most of it and use the model to try to interpret the charts. But there are often many ways to interpret a chart so that may be a challenge.
@hhljr
@hhljr 23 күн бұрын
Is there reason to expect that a Mac with more than 64GB of RAM would be able to perform better with the larger parameter models? I'm thinking of upgrading next year.
@simonosterloh1800
@simonosterloh1800 23 күн бұрын
faster? No, inference speed is mainly determined by GPU speed. higher quality? possibly. More RAM allows you to choose larger models or less quantized version.
@technovangelist
@technovangelist 23 күн бұрын
I think so. The m4 is the first one that has perf significantly better than the m1 so would love to try it.
@technovangelist
@technovangelist 23 күн бұрын
I don’t think I agree with the higher quality statement. If that were true then llama3.1 70b would always be better than 7 or 8b. And that is often not the case. But since there is a big gain in performance on the m4 it should be better. The m1 maxes out at 64 so more implies getting a different system.
@hhljr
@hhljr 15 күн бұрын
@@simonosterloh1800 GPU speed, OK. How about number of GPU cores? I have an M1 Studio Ultra with 64GB. Presumably the M4 Ultra will have more GPU cores
@technovangelist
@technovangelist 15 күн бұрын
Tests I have seen have shown that the m4 max is nearly twice as fast as an M1 Max with ai models in ollama.
@fabriziocasula
@fabriziocasula 23 күн бұрын
with my mac 16gb ram is impossible to use this model 😞
@wardehaj
@wardehaj 23 күн бұрын
Thanks Matt for another great video! I would like to be able to use llama3.2 vision with openinterpreter in stead of chatgpt to control my computer
@sanjuburkule
@sanjuburkule 22 күн бұрын
I want to check one vehicle from one image and give if that same vehicle showed up in another video stream. And when was it seen last. (same tech for this case) I want to check where was my TV remote seen last.
@JeromeBoivin-tx7fm
@JeromeBoivin-tx7fm 23 күн бұрын
Hi Matt, I’d like to use a local RAG to find pictures of my kids and other people in my personal photo library. I previously spent a lot of time identifying faces and labeling them under Digikam, so that the bounding boxes and face names are stored in the Exif metadata. Unfortunately Ollama still completely ignores the image metadata and generates an anonymous description of my photos… I would then need to use an external library to load these data but I don’t know how I’d tell who is where on the picture.
@marcoslealdev
@marcoslealdev 18 күн бұрын
The Llava LLM did that a long time ago. You should refer and make comparison in the future.
@technovangelist
@technovangelist 18 күн бұрын
But llama3.2 is just so much better
@sanjuburkule
@sanjuburkule 22 күн бұрын
I want to check which if shoplifting is happening or not. This needs fast comparison of many images per second from the video. Will this work? If there a easier way to infer shoplifting?
@syirrus
@syirrus 23 күн бұрын
Thank you for the video. I have the same MacBook and I will try out that 90b parameter model. Lie you said, it might just be too large to run on it considering you have to run the OS too
@emaayan
@emaayan 23 күн бұрын
i'd love to see if i can give it an image of a floor map along with icons of access points spread across it, give it a scale of pixel per meters and see if it can say what's the density of those ap's per map (not sure if anything can do that)
@bweagle
@bweagle 23 күн бұрын
I tried using it to tag some photos to make them easy to search. It didn't seem to do very well at that task.
@emmanuelgoldstein3682
@emmanuelgoldstein3682 23 күн бұрын
Sorry for being so off-topic but where did you get that shirt? It's freakin epic
@technovangelist
@technovangelist 23 күн бұрын
Link in the description. Amazon
@mightyboessu
@mightyboessu 23 күн бұрын
What I try to do? Very simple: describing/tagging/grouping the own pictures without upload them to a cloud like google. Basically a Raspberry Pi would be enough, as it doesn't matter how long it takes until it went through all pictures...
@technovangelist
@technovangelist 23 күн бұрын
Nice
@MontyCraig
@MontyCraig 23 күн бұрын
Works great for me
@AntonioCorrenti-b4e
@AntonioCorrenti-b4e 23 күн бұрын
I tested it yesterday and unfortunately the model doesn't support tools 😞
@nedkelly3610
@nedkelly3610 23 күн бұрын
Awesome, I wonder how this model would go with "computer use" locally, faster and much cheaper than claude.
@miloldr
@miloldr 23 күн бұрын
As much as I saw of Claude computer use, it specifies exact pixels location and I don't think it's so easy for model to find exact pixels to click some button
@chizzlemo3094
@chizzlemo3094 23 күн бұрын
great shirts too, not sure llama3.2 can handle those lol.
@sitedev
@sitedev 23 күн бұрын
I think your shirt dropped acid.
@technovangelist
@technovangelist 23 күн бұрын
It’s a bright one. The other ones I would wear outside in the real world, but this one is ... special
@MathewRenfro
@MathewRenfro 23 күн бұрын
@0:45 you forgot Matt from Mattvidpro (I am also called Mat, but I don't regularly produce AI content)
@jsward17
@jsward17 22 күн бұрын
I have a real estate use case that needs to identify problems with a house from a distance.
@woolfel
@woolfel 23 күн бұрын
you just need 128G of memory instead :) you're welcome. Joking aside, m4max fully loaded might get better results. once people get their hands on them, I'm sure we will see people try
@mylastore
@mylastore 18 күн бұрын
It work with my web ui
@NLPprompter
@NLPprompter 21 күн бұрын
LOL should i change my name to MattNLP?? LOL
@thenoblerot
@thenoblerot 22 күн бұрын
Matt Williams > Berman > Wolfe Those other guys.... I swear they have no idea what they're talking about. 😅
@thenextension9160
@thenextension9160 22 күн бұрын
Personally, I don’t think it got the meme photo right at all. It looked as though it assumed/hullicinated people were doing an aggressive approach to flag planting. Versus the corporate one that showed one worker digging one deep hole, which would be appropriate for flag planting. I think it’s obvious as a human is that it’s a joke saying that start ups generate more output due to more contribution versus corporations that have a lot of administrative overhead that does not add to the output.
@technovangelist
@technovangelist 21 күн бұрын
I think that comment means that you haven't worked at many startups.
@tokyohouseparty
@tokyohouseparty 23 күн бұрын
he doesnt even answer the comments
@technovangelist
@technovangelist 23 күн бұрын
One of the other matts? I answer almost all of the questions that are questions. Which Matt doesn't?
@Joooooooooooosh
@Joooooooooooosh 23 күн бұрын
My dude, the sweeping hand gestures are getting out of control.
@technovangelist
@technovangelist 23 күн бұрын
If you don't like how I naturally talk, it was nice of you to stop by, and I will miss you, but I won't change who I am for KZbin.
@Joooooooooooosh
@Joooooooooooosh 23 күн бұрын
@ jesus dude 😂
@technovangelist
@technovangelist 23 күн бұрын
I often mis read comments....
@tibssy1982
@tibssy1982 23 күн бұрын
Cool, But unfortunately the model doesn't load for me. I got this error: Error: llama runner process has terminated: GGML_ASSERT(ggml_nelements(a) == ne0*ne1*ne2) failed my ollama version: • 󰅙 pacman -Q | grep ollama ollama 0.4.0-1.1
Don’t Embed Wrong!
11:42
Matt Williams
Рет қаралды 14 М.
Llama 3.2 Vision 11B LOCAL Cheap AI Server Dell 3620 and 3060 12GB GPU
32:53
I thought one thing and the truth is something else 😂
00:34
عائلة ابو رعد Abo Raad family
Рет қаралды 13 МЛН
Farmer narrowly escapes tiger attack
00:20
CTV News
Рет қаралды 13 МЛН
Beat Ronaldo, Win $1,000,000
22:45
MrBeast
Рет қаралды 55 МЛН
Local LLM Challenge | Speed vs Efficiency
16:25
Alex Ziskind
Рет қаралды 83 М.
AI Vision Explained (and How to Avoid Paying for It)
13:27
FortNine
Рет қаралды 415 М.
Getting Started on Ollama
11:26
Matt Williams
Рет қаралды 63 М.
NVIDIA CEO Jensen Huang Leaves Everyone SPEECHLESS (Supercut)
18:49
Ticker Symbol: YOU
Рет қаралды 992 М.
10 AI Animation Tools You Won’t Believe are Free
16:02
Futurepedia
Рет қаралды 457 М.
Qwen Just Casually Started the Local AI Revolution
16:05
Cole Medin
Рет қаралды 100 М.
New Computing Breakthrough: 100 Million Times Faster than GPUs!
18:45
Anastasi In Tech
Рет қаралды 239 М.
I put ChatGPT on a Robot and let it explore the world
15:24
Nikodem Bartnik
Рет қаралды 849 М.
I thought one thing and the truth is something else 😂
00:34
عائلة ابو رعد Abo Raad family
Рет қаралды 13 МЛН