How Microsoft gets AI to Click the Right Buttons!

Рет қаралды 9,110

Sam Witteveen

Күн бұрын

Пікірлер

@EnricoGolfettoMasella Ай бұрын

I feel your voice now is a good AI generation 👌🏼

@drowningpenguin1588 Ай бұрын

Automated feature testing! This will be useful for testing UIs during development.

@xthesayuri5756 Ай бұрын

the problem is that it's not really open source since it uses ultralytics yolo it also has agpl-3 license which makes it basically unusable for commercial use

@samwitteveenai Ай бұрын

Good point! I didn't realize that when I was actually recording the video.

@alphamindset9634 Ай бұрын

What would happen if you tried using it for commercial use and how would they know

@thenoblerot Ай бұрын

I'm kinda surprised nobody has tried to shoehorn Gemini into a computer use pipeline. Gemini is actually really good at creating pixel level accurate arbitrary bounding boxes from an image. Giving Gemini a screenshot with even a simple prompt like: "Create labels and bounding boxes for clickable user interface elements. Use the format: id#, element label, element type, bounding box" actually works remarkably well, and has for months! With more engineered prompting or tuning it seems like it could be valuable in a computer use workflow. Thanks, Sam!

@samwitteveenai Ай бұрын

Yes Gemini can do a lot of these things. My guess it’s worth a try but fine tuning for this would make it work even better

@justtiredthings Ай бұрын

Wait, what? Gemini can edit images?

@thenoblerot Ай бұрын

@@justtiredthings No, not edit. It can give x,y bounding box coordinates for objects, like: dog (120, 100),(240, 200) In short, it can give coordinates of objects/elements in a picture like YOLO would, but for arbitrary elements. You can play around with it for free in Google AI Studio.

@thenoblerot Ай бұрын

@@justtiredthings no, not edit. Just identify the x,y pixel coordinates like YOLO would.

@waneyvin Ай бұрын

Thanks a lot! I'd been looking for this solution for long time👍👍👍,

@midnightmoves7976 Ай бұрын

Very cool. thank you for the information.

@novantha1 Ай бұрын

You know, there’s all this talk of dynamic computer use through vision, and it’s an interesting field, but I feel as though there’s surely a much simpler solution. In HTML when embedding images, it’s customary to provide a textual description for use with screen readers for the visually impaired, or as an alternative display option for ie: bad internet, or for users who have a different environment that might not support image display (for some reason). It seems to me that some sort of open standard for labelling icons, and elements of programs in a similar way, allowing for natural language interaction with programs and operating systems is a much simpler way to enable computer use, to my eye.

@MrMoonsilver Ай бұрын

Microservice video +1

@ZLibrary-k6r Ай бұрын

Great going

@RaitisPetrovs-nb9kz Ай бұрын

The use case are so many pretty much everything where API are not available.

@alphamindset9634 Ай бұрын

I didn’t get your comment?

@fabriciot4166 Ай бұрын

Thanks for the contribution👍! A simple but useful use is that code/software repositories have a .txt with the prompt to a LLM and use this system (computer use) to do the installation of libraries and requirements (which can sometimes be a headache even for those who are in software development) to be able to test some code or software, indicating to the user how to proceed (For example to install web ui or forge and test stable difussion or another). Each software could have its indications to guide the user in the basic use, etc., the possibilities are endless.

@GNARGNARHEAD Ай бұрын

can't wait till I can have my own AI play video games for me, I just can't click that much anymore, would love to be able to indirectly interact with games like Bannerlord

@samwitteveenai Ай бұрын

lol Life becomes an Idle Game

@maxray1796 Ай бұрын

I would try to use it for RPA.

@samwitteveenai Ай бұрын

Certainly these kinds of models and systems are making it much easier to make custom RPA applications

@alphamindset9634 Ай бұрын

What’s RPA?

@maxray1796 Ай бұрын

@@alphamindset9634 Robotic Process Automation. RPA is like using software robots to handle the boring, repetitive stuff on a computer-things like moving data between apps or filling out forms-so people don’t have to.

@oshodikolapo2159 Ай бұрын

Is there a way to use this via API? Just like chatgpt and the likes?

@samwitteveenai Ай бұрын

Not currently, you would need to basically put it on a service yourself and create the API endpoint.

@oshodikolapo2159 Ай бұрын

We can use via gradio_client

@alphamindset9634 Ай бұрын

@@samwitteveenaiwhat do you mean by this

@WillJohnston-wg9ew Ай бұрын

I'm working on a tool in support of therapists leveraging AI. I'd like to create agents which can perform research in the background based on client sessions and return data and translate into insights.

@amandamate9117 Ай бұрын

do it

@ibrahimhalouane8130 Ай бұрын

Nice, wish to see how could this work in edge devices and/or with tiny models like MiniCPM

@sitedev Ай бұрын

How long before this is redundant? Surely it would be easier for Microsoft/Apple to add API-like functionality to their OS and train custom models to integrate with them.

@samwitteveenai Ай бұрын

So I do think you will find the Os teams adding APIs for these things (they have some already in the form of accessibility APIs ) but I don't think they're going to make those APIs necessarily public-facing. They'll probably just keep them for internal use.