Turning an image into an open book is amazing. What we have today with language models is the natural progression of what started with RNNs, then transformers. Over time, things improved-better architectures, scale laws, larger datasets, and now we have these sophisticated language models. It’s a gradual evolution. But for images, it’s a completely different story. This isn’t an extension of conventional image processing techniques like classification, object detection, or segmentation. It’s something entirely new. The process essentially transforms an image into text, enabling us to dig in, ask questions, or extract information, natively, using the same model that processes text and audio. Everything becomes a text sequence. What’s fascinating is that it bypasses all the classical image processing methods: no need for specialized data preparation, binarization, or other traditional steps. It’s a totally different solution to the problem, redefining how we process and understand images. This shift is what truly amazes me, it’s not just an improvement, but a fundamental change.
@engineerprompt20 күн бұрын
I agree! having worked with CNNs in the early days and making a cat vs. dog classifier felt like magic without hand written features. This is a whole new level. A single model that can understand different modalities unlock applications that were not possible before.
@Joethegamer519 күн бұрын
I tested it with 6 languages with the Talk to Gemini feature. It can seamlessly switch between languages and although the accent in some languages is not perfect, it works insanely well!
@DearGeorge320 күн бұрын
Very useful!
@AbhishekMane-s9q17 күн бұрын
Can you make a video on How to Use Gemini 2.0 api key for our Own Text to Speech and Speech T Text conversation.
@JordanC-f5j20 күн бұрын
Would be nice to see how it compares to Sonnet 3.5. Gemini seems to score higher on various benchmarks but it'd be nice to see real problem solving in different fields and how closely it follows the instructions.
@engineerprompt20 күн бұрын
working on it :)
@tollington941420 күн бұрын
Very interesting
@_abdul20 күн бұрын
They Cooked, And this time it's Tasty.
@CollinParan20 күн бұрын
Multimodal V2LMs are the way
@engineerprompt20 күн бұрын
agree
@MrAhsan9920 күн бұрын
is this me or this guy's voice has changed?
@SingularityReacts50520 күн бұрын
just tried chat GPT with vision it's so much better than this garbage it's not even funny opening AI stays winning
@sillybilly34620 күн бұрын
Bot deployed by openai?
@SingularityReacts50520 күн бұрын
@sillybilly346 no it's just advanced voice mode with vision it's f****** insane