How Microsoft gets AI to Click the Right Buttons!

  Рет қаралды 9,110

Sam Witteveen

Sam Witteveen

Күн бұрын

Пікірлер
@EnricoGolfettoMasella
@EnricoGolfettoMasella Ай бұрын
I feel your voice now is a good AI generation 👌🏼
@drowningpenguin1588
@drowningpenguin1588 Ай бұрын
Automated feature testing! This will be useful for testing UIs during development.
@xthesayuri5756
@xthesayuri5756 Ай бұрын
the problem is that it's not really open source since it uses ultralytics yolo it also has agpl-3 license which makes it basically unusable for commercial use
@samwitteveenai
@samwitteveenai Ай бұрын
Good point! I didn't realize that when I was actually recording the video.
@alphamindset9634
@alphamindset9634 Ай бұрын
What would happen if you tried using it for commercial use and how would they know
@thenoblerot
@thenoblerot Ай бұрын
I'm kinda surprised nobody has tried to shoehorn Gemini into a computer use pipeline. Gemini is actually really good at creating pixel level accurate arbitrary bounding boxes from an image. Giving Gemini a screenshot with even a simple prompt like: "Create labels and bounding boxes for clickable user interface elements. Use the format: id#, element label, element type, bounding box" actually works remarkably well, and has for months! With more engineered prompting or tuning it seems like it could be valuable in a computer use workflow. Thanks, Sam!
@samwitteveenai
@samwitteveenai Ай бұрын
Yes Gemini can do a lot of these things. My guess it’s worth a try but fine tuning for this would make it work even better
@justtiredthings
@justtiredthings Ай бұрын
Wait, what? Gemini can edit images?
@thenoblerot
@thenoblerot Ай бұрын
@@justtiredthings No, not edit. It can give x,y bounding box coordinates for objects, like: dog (120, 100),(240, 200) In short, it can give coordinates of objects/elements in a picture like YOLO would, but for arbitrary elements. You can play around with it for free in Google AI Studio.
@thenoblerot
@thenoblerot Ай бұрын
@@justtiredthings no, not edit. Just identify the x,y pixel coordinates like YOLO would.
@waneyvin
@waneyvin Ай бұрын
Thanks a lot! I'd been looking for this solution for long time👍👍👍,
@midnightmoves7976
@midnightmoves7976 Ай бұрын
Very cool. thank you for the information.
@novantha1
@novantha1 Ай бұрын
You know, there’s all this talk of dynamic computer use through vision, and it’s an interesting field, but I feel as though there’s surely a much simpler solution. In HTML when embedding images, it’s customary to provide a textual description for use with screen readers for the visually impaired, or as an alternative display option for ie: bad internet, or for users who have a different environment that might not support image display (for some reason). It seems to me that some sort of open standard for labelling icons, and elements of programs in a similar way, allowing for natural language interaction with programs and operating systems is a much simpler way to enable computer use, to my eye.
@MrMoonsilver
@MrMoonsilver Ай бұрын
Microservice video +1
@ZLibrary-k6r
@ZLibrary-k6r Ай бұрын
Great going
@RaitisPetrovs-nb9kz
@RaitisPetrovs-nb9kz Ай бұрын
The use case are so many pretty much everything where API are not available.
@alphamindset9634
@alphamindset9634 Ай бұрын
I didn’t get your comment?
@fabriciot4166
@fabriciot4166 Ай бұрын
Thanks for the contribution👍! A simple but useful use is that code/software repositories have a .txt with the prompt to a LLM and use this system (computer use) to do the installation of libraries and requirements (which can sometimes be a headache even for those who are in software development) to be able to test some code or software, indicating to the user how to proceed (For example to install web ui or forge and test stable difussion or another). Each software could have its indications to guide the user in the basic use, etc., the possibilities are endless.
@GNARGNARHEAD
@GNARGNARHEAD Ай бұрын
can't wait till I can have my own AI play video games for me, I just can't click that much anymore, would love to be able to indirectly interact with games like Bannerlord
@samwitteveenai
@samwitteveenai Ай бұрын
lol Life becomes an Idle Game
@maxray1796
@maxray1796 Ай бұрын
I would try to use it for RPA.
@samwitteveenai
@samwitteveenai Ай бұрын
Certainly these kinds of models and systems are making it much easier to make custom RPA applications
@alphamindset9634
@alphamindset9634 Ай бұрын
What’s RPA?
@maxray1796
@maxray1796 Ай бұрын
@@alphamindset9634 Robotic Process Automation. RPA is like using software robots to handle the boring, repetitive stuff on a computer-things like moving data between apps or filling out forms-so people don’t have to.
@oshodikolapo2159
@oshodikolapo2159 Ай бұрын
Is there a way to use this via API? Just like chatgpt and the likes?
@samwitteveenai
@samwitteveenai Ай бұрын
Not currently, you would need to basically put it on a service yourself and create the API endpoint.
@oshodikolapo2159
@oshodikolapo2159 Ай бұрын
We can use via gradio_client
@alphamindset9634
@alphamindset9634 Ай бұрын
@@samwitteveenaiwhat do you mean by this
@WillJohnston-wg9ew
@WillJohnston-wg9ew Ай бұрын
I'm working on a tool in support of therapists leveraging AI. I'd like to create agents which can perform research in the background based on client sessions and return data and translate into insights.
@amandamate9117
@amandamate9117 Ай бұрын
do it
@ibrahimhalouane8130
@ibrahimhalouane8130 Ай бұрын
Nice, wish to see how could this work in edge devices and/or with tiny models like MiniCPM
@sitedev
@sitedev Ай бұрын
How long before this is redundant? Surely it would be easier for Microsoft/Apple to add API-like functionality to their OS and train custom models to integrate with them.
@samwitteveenai
@samwitteveenai Ай бұрын
So I do think you will find the Os teams adding APIs for these things (they have some already in the form of accessibility APIs ) but I don't think they're going to make those APIs necessarily public-facing. They'll probably just keep them for internal use.
@Saif-G1
@Saif-G1 Ай бұрын
Yeah. Make an agent that can control our phones and please for android.
@samwitteveenai
@samwitteveenai Ай бұрын
The big challenge here is actually running these kinds of models on a mobile phone at the moment, but I agree it would be pretty cool.
@vassovas
@vassovas Ай бұрын
This must be what Microsoft's Power Automate is built on surely? Been a thing for awhile?
@samwitteveenai
@samwitteveenai Ай бұрын
All the companies have been working on these kinds of things for a while, so very well could be.
omniparser
2:07
vava
Рет қаралды 74
TML Scaling with Kubernetes and Ingress
11:12
Otics Advanced Analytics
Рет қаралды 4
PydanticAI - The NEW Agent Builder on the Block
21:45
Sam Witteveen
Рет қаралды 27 М.
Has Generative AI Already Peaked? - Computerphile
12:48
Computerphile
Рет қаралды 1 МЛН
The New Outlook is TERRIBLE
20:19
Chris Titus Tech
Рет қаралды 119 М.
Open Reasoning vs OpenAI
26:59
Sam Witteveen
Рет қаралды 30 М.
Build anything with bolt.new, here’s how
21:15
David Ondrej
Рет қаралды 134 М.
2 Years of LLM Advice in 35 Minutes (Sully Omar Interview)
49:04
Greg Kamradt
Рет қаралды 42 М.
Multi-Agent AI EXPLAINED: How Magentic-One Works
16:39
Sam Witteveen
Рет қаралды 18 М.
NVIDIA CEO Jensen Huang Leaves Everyone SPEECHLESS (Supercut)
18:49
Ticker Symbol: YOU
Рет қаралды 1 МЛН
Google’s Quantum Chip: Did We Just Tap Into Parallel Universes?
9:34