Automated feature testing! This will be useful for testing UIs during development.
@xthesayuri5756Ай бұрын
the problem is that it's not really open source since it uses ultralytics yolo it also has agpl-3 license which makes it basically unusable for commercial use
@samwitteveenaiАй бұрын
Good point! I didn't realize that when I was actually recording the video.
@alphamindset9634Ай бұрын
What would happen if you tried using it for commercial use and how would they know
@thenoblerotАй бұрын
I'm kinda surprised nobody has tried to shoehorn Gemini into a computer use pipeline. Gemini is actually really good at creating pixel level accurate arbitrary bounding boxes from an image. Giving Gemini a screenshot with even a simple prompt like: "Create labels and bounding boxes for clickable user interface elements. Use the format: id#, element label, element type, bounding box" actually works remarkably well, and has for months! With more engineered prompting or tuning it seems like it could be valuable in a computer use workflow. Thanks, Sam!
@samwitteveenaiАй бұрын
Yes Gemini can do a lot of these things. My guess it’s worth a try but fine tuning for this would make it work even better
@justtiredthingsАй бұрын
Wait, what? Gemini can edit images?
@thenoblerotАй бұрын
@@justtiredthings No, not edit. It can give x,y bounding box coordinates for objects, like: dog (120, 100),(240, 200) In short, it can give coordinates of objects/elements in a picture like YOLO would, but for arbitrary elements. You can play around with it for free in Google AI Studio.
@thenoblerotАй бұрын
@@justtiredthings no, not edit. Just identify the x,y pixel coordinates like YOLO would.
@waneyvinАй бұрын
Thanks a lot! I'd been looking for this solution for long time👍👍👍,
@midnightmoves7976Ай бұрын
Very cool. thank you for the information.
@novantha1Ай бұрын
You know, there’s all this talk of dynamic computer use through vision, and it’s an interesting field, but I feel as though there’s surely a much simpler solution. In HTML when embedding images, it’s customary to provide a textual description for use with screen readers for the visually impaired, or as an alternative display option for ie: bad internet, or for users who have a different environment that might not support image display (for some reason). It seems to me that some sort of open standard for labelling icons, and elements of programs in a similar way, allowing for natural language interaction with programs and operating systems is a much simpler way to enable computer use, to my eye.
@MrMoonsilverАй бұрын
Microservice video +1
@ZLibrary-k6rАй бұрын
Great going
@RaitisPetrovs-nb9kzАй бұрын
The use case are so many pretty much everything where API are not available.
@alphamindset9634Ай бұрын
I didn’t get your comment?
@fabriciot4166Ай бұрын
Thanks for the contribution👍! A simple but useful use is that code/software repositories have a .txt with the prompt to a LLM and use this system (computer use) to do the installation of libraries and requirements (which can sometimes be a headache even for those who are in software development) to be able to test some code or software, indicating to the user how to proceed (For example to install web ui or forge and test stable difussion or another). Each software could have its indications to guide the user in the basic use, etc., the possibilities are endless.
@GNARGNARHEADАй бұрын
can't wait till I can have my own AI play video games for me, I just can't click that much anymore, would love to be able to indirectly interact with games like Bannerlord
@samwitteveenaiАй бұрын
lol Life becomes an Idle Game
@maxray1796Ай бұрын
I would try to use it for RPA.
@samwitteveenaiАй бұрын
Certainly these kinds of models and systems are making it much easier to make custom RPA applications
@alphamindset9634Ай бұрын
What’s RPA?
@maxray1796Ай бұрын
@@alphamindset9634 Robotic Process Automation. RPA is like using software robots to handle the boring, repetitive stuff on a computer-things like moving data between apps or filling out forms-so people don’t have to.
@oshodikolapo2159Ай бұрын
Is there a way to use this via API? Just like chatgpt and the likes?
@samwitteveenaiАй бұрын
Not currently, you would need to basically put it on a service yourself and create the API endpoint.
@oshodikolapo2159Ай бұрын
We can use via gradio_client
@alphamindset9634Ай бұрын
@@samwitteveenaiwhat do you mean by this
@WillJohnston-wg9ewАй бұрын
I'm working on a tool in support of therapists leveraging AI. I'd like to create agents which can perform research in the background based on client sessions and return data and translate into insights.
@amandamate9117Ай бұрын
do it
@ibrahimhalouane8130Ай бұрын
Nice, wish to see how could this work in edge devices and/or with tiny models like MiniCPM
@sitedevАй бұрын
How long before this is redundant? Surely it would be easier for Microsoft/Apple to add API-like functionality to their OS and train custom models to integrate with them.
@samwitteveenaiАй бұрын
So I do think you will find the Os teams adding APIs for these things (they have some already in the form of accessibility APIs ) but I don't think they're going to make those APIs necessarily public-facing. They'll probably just keep them for internal use.
@Saif-G1Ай бұрын
Yeah. Make an agent that can control our phones and please for android.
@samwitteveenaiАй бұрын
The big challenge here is actually running these kinds of models on a mobile phone at the moment, but I agree it would be pretty cool.
@vassovasАй бұрын
This must be what Microsoft's Power Automate is built on surely? Been a thing for awhile?
@samwitteveenaiАй бұрын
All the companies have been working on these kinds of things for a while, so very well could be.