Dammit stop telling everybody about Jina my secret weapon, just stop, it's my advantage, everybody ignore it it's horrible I swear
@devlearnllm4 ай бұрын
TOO LATE
@jarad46214 ай бұрын
@@devlearnllm 😉. This was one of the mostly highly valuable vids ive seen in past few weeks when considering the contents, the top 3 special scrapers i searched hard for mentioned all together in one good video, nice, add to good cheap open source llm like llama 3 and it = $$$ if you know how, data is valuable, things that were not possible or affordably viable for most previously are now, i can do stuff for $12 now that some would pay thousands for, its a wonderful new world! Just finished something awesome with python and Jina and openrouter Llama 3 in 2 days thats gonna double my revenue or more and i dont even know how to code lol, thanks gpt. Jina does have paid api key on the api page btw, 1m free, 580 pages or so it worked out to. but the pricing is so low its insane, 500m tokens or 280 000 pages for $10, destroys firecrawl pricing, which is also good and has its place but much more costly). i think scrapegraph uses llm to parse so its gonna be expensive on tokens right, sending raw website to llms? Ive asked them like you did. I only wish Jina showed menus and internal links and it would be perfect, those have valuable data itself and identifies more valuable pages for more visits like pricing, ill ask if there is a way but i guess can add something cheap to the workflow for that, any suggestions? Prob some python libary, ill ask perplexity lol. Im actually new to the tech side but i see the business value as a marketer so learning fast as i can! its the new gold rush. Great video, subbed looking forward to more. Cheers
@TheBrighamhall4 ай бұрын
@@devlearnllm thought you said Jira and was so confused..
@devlearnllm4 ай бұрын
@@TheBrighamhall imagine lol
@ShinyTechThings4 ай бұрын
🤣🤣🤣🤣
@alonsoalarconaguilar71134 ай бұрын
KZbin algorithm is just insanely good at what it does, this exactly the content I needed and I think I have found what I want to dedicate my life to as a professional. Thank you for the video, I will buy your course as fast as I collect the money.
@florianhonicke54482 ай бұрын
@LLMs for Devs. I'm from Jina AI. Cool that you are using our reader app. I like seeing the exact use-cases people use that one - very interesting.
@devlearnllm2 ай бұрын
Big fan of Jina.
@jeffc173628 күн бұрын
@@devlearnllm hi times are tough. can I borrow 10000k? I need rent money and lost my job as a retail worker at Dicks sporting goods in dallas.
@kylelau13294 ай бұрын
Thank you for introducing all the latest technology for web scraping!
@matten_zero4 ай бұрын
The reader API tip is so clutch. Thank You!
@Breaking_Bold20 күн бұрын
I like this format of video...background has a large monitor...Nice video
@NikhilSwamiExperimental3 ай бұрын
chigga dropping bomb content, meranwhile i made a comment analyzer for highly detailed videos which have 100+ comments, and dint have time for going through all. man, sometimes you dont need to build an ironman suit to do simple shet.
@devlearnllm3 ай бұрын
Printing this comment out and putting on my wall
@aarushsaboo11943 ай бұрын
Bro, did you build a comment analyzer for all youtube videos in which all you need to do is post a youtube link? That's a nice project!
@NikhilSwamiExperimental20 күн бұрын
@@aarushsaboo1194 its impossible to read thousands of comment bro, and time is money.
@juroo18 күн бұрын
pure gold, thanks man!
@ariG234983 ай бұрын
How did I not get your content sooner? Love it!
@MrPkmonsterАй бұрын
Thank you so much for the presentation. Just in time with the latest scraping technology
@devlearnllmАй бұрын
You bet!
@antoniuskonovalov4 ай бұрын
Just started wondering about web scraping and here you are. Thank you.
@roberthuff31224 ай бұрын
🎯 Key Takeaways for quick navigation: 00:00 *🚀 Introduction to web scraping for LLMs in 2024* - Overview of startups pivoting to web scraping. - Mention of Mendable and its "fire crawl" tool for scraping the web using large language models. 02:06 *🔍 Scraping competitors' pricing pages* - The process of scraping competitors' pricing for market research. - Introduction to tools used for scraping: Jina AI, Mendable, and Scrapegraph-ai. 03:01 *🧠 Understanding "Tik token" and its application* - Explanation of tokenization and encoding in web scraping. - Discussion on the cost implications based on tokenization. 05:17 *🛠️ Setting up scrapers with Beautiful Soup and other tools* - Description of different scraping tools and their setup. - Comparisons among Beautiful Soup, Jina AI, and Mendable based on ease of use and output. 07:32 *📊 Running scrapers and analyzing outputs* - Execution of web scraping and evaluation of the output from different tools. - Analysis of readability and format of the scraped data. 09:37 *💰 Cost comparison and effectiveness of scraping tools* - Comparison of costs associated with various scraping tools. - Evaluation of which tool provides the most value for money. 12:53 *🤖 Extracting pricing information using OpenAI* - Utilization of OpenAI for extracting specific data points. - Challenges and strategies in obtaining clean and useful information. 17:20 *🌐 Overview of Scrapegraph for advanced web scraping* - Introduction to Scrapegraph as an open-source project. - Examples of complex data extraction and its accuracy. Made with HARPA AI
@thethree60five4 ай бұрын
...The best in-browser AI automation system.
@thedoctor54784 ай бұрын
This Jina thing is cool. The beautifulsoup scraper is obviously not a solution. Most web pages (Especially articles, media, etc.) have google schema ld+json ready to be extracted though. There are some good python libs for getting the metadata. There are many scraping APIs, and most of them are not worth the cost IMO. phantomjscloud is probably one exception, depending on volume. Otherwise, one must find a good proxy provider and send a bunch of fancy http headers to bypass anti-bot, like you said. Blackhatworld is a great resource for proxies and all manner of other accounts. The whole scraping thing is a giant rabbit-hole. Jina is for sure keeping all that data. It's not a bad plan, actually. I think I may do the same.
@forgotmyoldSN3 ай бұрын
Thanks for adding a new project to my to do list!
@uwepleban37844 ай бұрын
The transcript at 1:39 states that you are using large sandwich models. This must be a brand new type of model - mouth watering indeed. 😂
@devlearnllm4 ай бұрын
Heck yeah 🥪
@ishaquealidad8361Ай бұрын
I replied three times at 1:39. Is he saying 'large sandwich model?🤣
@nickk65754 ай бұрын
Greta video! The open source tool looks great! As an aside, I use instructor and pydantic classes to get the LLMs to provide the JSON as I expect it. In my limited experience, dspy wasn't as explicit as I wanted.
@devlearnllm4 ай бұрын
Good idea
@jarad46214 ай бұрын
Are you you using thos two libraries with agency swarm agentic framework, it uses those a well to ensure performance/quality, if not maybe something you might be interested in, a proper production-capable agentic framework. That with its automation and decision-making capability plus Jina + llms = profit for so many use cases
@catchychazz4 ай бұрын
Are you referring to DSPy assertions?
@chanliah19182 ай бұрын
Great demo, thank you!
@devlearnllm2 ай бұрын
My pleasure!
@CkoraybingolАй бұрын
Great intro and work flow. Thanks a lot.
@devlearnllmАй бұрын
Much appreciated!
@supriyosarkar18063 ай бұрын
I feel really sad. that you publicly talked about Jina. I used to feel special knowing very few people are aware of it lol
@devlearnllm3 ай бұрын
my badd
@kuhltime4 ай бұрын
Came at the perfect time. Very good video. Thx 😊
@devlearnllm4 ай бұрын
If anyone’s having issues viewing the notebook on GitHub, it’s GitHub’s fault. Feel free to clone it (the cod e is there, GH just couldn’t display it recently: stackoverflow.com/questions/78501731/error-nbformat-when-uploading-to-github-from-google-colab)
@shuaiwang40922 ай бұрын
So valueable video content! Many thanks for sharing~~
@robelbelay406529 күн бұрын
Great stuff man, thanks a lot!
@devlearnllm28 күн бұрын
Cheers!
@terrytan18273 ай бұрын
16:05 Worth trying out GPT-4, I find it more accurate at following instruction.
@BlueBearOne4 ай бұрын
Thank you. I'll be "away" for a while while I conquer the...I mean save the world!
@markt45653 ай бұрын
keep up the good work! - this is an awesome presentation!
@devlearnllm3 ай бұрын
TY
@frasonfrancis96982 ай бұрын
I don’t know how effective will this be in a long run especially due to the security update of cloudflare to block AI web scraping agents
@planplay59214 ай бұрын
But the first problem that all crawls need to face is how to avoid being blocked.
@PracticalAI_3 ай бұрын
there are ways, maybe I will do a video about that ... but that is a dark art :)
@planplay59213 ай бұрын
@@PracticalAI_ I'm really looking forward to it!😊
@Van-Helssen3 ай бұрын
Rotation of proxies and query randomly dude, easy task
@PracticalAI_3 ай бұрын
@@Van-Helssen lol it's not 2014, proxies are recognised by most providers, and they will immediately invalidate the user (if you are scraping as login). There are other ways, using regular ips
@Van-Helssen3 ай бұрын
@@PracticalAI_ *residential proxies as you would probably know….
@shivam_in4 ай бұрын
If I'm going to scrap millions of pages regularly, no way in hell AI would come anywhere close in accuracy and efficiency than a plane Http request or browser load and Jsoup parsing.
@TranKiet-pj9mw3 ай бұрын
youtube really know what i am looking :V with python craw a website with LLM is simple just a few line of code . back to 8 year ago i used python tool do a same thing with higher effort . right now , i m trying to mixed data from website/ database with knowledge map for observation view then i could find the short path according its , that will taking less time to read entire book in this field , just focus in some topic but still get the result . nah but you introduced the method with LLM . thanks
@devlearnllm3 ай бұрын
Awesome. Thanks for sharing
@jetlime083 ай бұрын
Is the LLM community really not aware of 40 year old Natural Language Pre-processing methods developed for data mining and NLP?
@erickcampos503 ай бұрын
Could you explain it better? I can't see how to connect what you said with this subject
@Bluesourboy3 ай бұрын
I don't know if the community is aware that this has been a problem to solve for quite some time.
@moafro65243 ай бұрын
Underrated glad I found
@st.3m9064 ай бұрын
Amazing video, thank you
@BobKane-g6x4 ай бұрын
GPT 4o can do this now. Just tested and it's awesome.
@ronaldokun3 ай бұрын
Thank you!!!!!!
@AtharvDharmadhikari-vc9fk4 ай бұрын
I used scrapegraph ai and was also stuck to get cost, but then I just took the cost my making some changes inside the scrapegraphai library as internally the library is using langchain and langsmith so it was calculating the cost.
@devlearnllm4 ай бұрын
That's awesome. How do you get it to work with LangSmith?
@khemchay3 ай бұрын
Jina love it...
@antronx7Ай бұрын
Would be cool to make AI website scraper that strips away all javascript bloat from a webpage and converts it into lightweight basic html page while preserving functionality. Would be great as a proxy service to make loading modern web pages fast on slow phones on poor data connections. Modern web is way too bloated. I sometimes manually archive a page by deleting all javascript in notepad++ and modify image embed links to point to locally saved .png files. That takes a long time but I can reduce 5MB page down to 200kB and save that. Would be nice to have smart automated tool to do that in seconds.
@nzt293 ай бұрын
Haven’t watched it fully yet, but I’m really curious to see how it handles the looming threat of model collapse. edit: Yeah it didn’t talk about it. It’s going to be hellish when the internet becomes increasingly flooded with LLM output
@artmadiar4 ай бұрын
Great presentation! I'm surprised about jin ai free scraper that doesn't require an API?!! I guess it might be shut down soon for public access
@jarad46214 ай бұрын
There is a paid version thats worth it, check the api page, key at bottom out generates a unique one somehow, you get 1m free then $10 for 500m tokens which is like 280k pages which is insanely low and basically free anyways, crazy valuable tool
@artmadiar4 ай бұрын
@@jarad4621 oh wow! it's amazing! thanks for clarification
@stevefox74694 ай бұрын
How do these tools cope with CloudFlare operating on the target site, which attempts to block scrapping?
@svenvanwier71964 ай бұрын
cant stop the bots i know about seleniumbase for python..... takes some research but... hey
@sitedev4 ай бұрын
Gold!
@RenkoGSL4 ай бұрын
lol that's awesome!
@theadaloguy2 ай бұрын
Great video, thanks. Is there a way to provide our own scraped data (so we can make sure we use a good stealth scraper and get all the content), and then the LLM analyses it like this?
@devlearnllm2 ай бұрын
Yeah, you can always just build an LLM chain to just extract data. You can find the example in the Google Colab I provided.
@You.Got.Lucky_2 ай бұрын
This video was really helpful for the people like me looking for webscrapping tools. Though I wonder if jinaAi is really free. Is there any challenge in using it for more number of links? Does it have rate limit on hitting urls with prefix? Any clarification on this is appreciated. : )
@devlearnllm2 ай бұрын
No hard limits as far as I know. Free for now (I think this is intentional), but definitely will change in the future.
@augmentos4 ай бұрын
Can anyone speak to the architecture or other tools to prevent detection using beautiful soup as he mentioned? What would be the best process to avoid detection and what tools I wish you elaborated there considering it’s the subject of video in large part.
@prashantbhardwaj63223 ай бұрын
Can you please fix the camera please already feeling dizzy within 60 seconds due to constant camera movement!
@devlearnllm3 ай бұрын
Working on it. Just need to find the setting in DJI Pocket 3 to slow down the tracking speed
@jakobkristensen2390Ай бұрын
Im curious how you handle pages where the content exceeds token window
@devlearnllmАй бұрын
I'm sure Firecrawl or Jina would have a rolling context window for extraction. It's an easy thing to implement.
@danielcave96063 ай бұрын
How well does Jina do with bigger sites with anti-bot protection?
@SonGoku-pc7jl2 ай бұрын
thansk, but difference or what is better gina reader or Scrapegraph-ai
@stanTrX4 ай бұрын
What are the good and easy to use tools with langchain? Llm is not very useful without such tools, even it has no idea about the date today.
@PaulFidika3 ай бұрын
"The entire internet hates him for this one simple trick"
@devlearnllm3 ай бұрын
9/10 prompt engineers recommend this
@bastabey26523 ай бұрын
these scrapping tools are impressive... but they are not ready for scrapping full website with 100s of webpages.. unfortunately, there is still significant a room for manual scraping..
@thingX1x3 ай бұрын
Using jina now hehe. Does anyone know if you can get better results from amazon?
@dhineshprabakaran1786Ай бұрын
Hi, I'm trying to scrape webdata from my Org Docs which is accessible only within VPN. Failed to goto 'docs url'. Can you help me with this ?
@NickaGillis18 күн бұрын
Can Jina handle sites with lazy load? Looking at dealership websites
@switch8291Ай бұрын
you havent updated us on how much does scrapegraph-ai takes in comparison
@devlearnllmАй бұрын
Ah shoot I forgot about that.
@lomash_irl2 ай бұрын
I guess selenium is still the choice for javascript heavy websites... any tips on this?
@eyoo3693 ай бұрын
Jina is almost perfect.. too bad it's not smart enough to scrape content from "accordions" where you first click to make the content visible. I feel a smart AI scraper should be able to grab that text and determine based on CSS class that it's probably valuable text.. just hidden at the time
@devlearnllm3 ай бұрын
That's too bad. What's the alternative?
@denisblack98974 ай бұрын
Damn, bro get ready for heavy lifting) baldness is coming Been there, you’ll look much much better!
@devlearnllm4 ай бұрын
Lmao thanks brother
@zaid6527Ай бұрын
I dont know if my question is stupid, but can you tell me can we take snapshots of website and use ocr and llms to scrape the useful info, instead of sending request to that website since it would look more humanly , and also use less requests
@devlearnllmАй бұрын
Yeah you can probably do that!
@zaid6527Ай бұрын
@@devlearnllm thanks 🤝
@nve-c5d2 ай бұрын
so what did you find out about scrapegraph ai performance , tokens
@jonathanpark873Ай бұрын
I wonder if you would update it to be able to use gpt-4o-mini as its much cheaper
@devlearnllmАй бұрын
yep
@marthasamuel4 ай бұрын
Would these work for a dynamic website
@GeoffY20204 ай бұрын
i tried to read or download the Web_scraping_for_LLM_in_2024.ipynb but its not readable, can you replace it ?
@GeoffY20204 ай бұрын
ok i can read it in colab
@MMABeijing4 ай бұрын
That s basic stuff, I feel like it s 2023, and I was late to the party too
@JohnMcclaned3 ай бұрын
such an inefficient and unreliable way to scrape the web
@ryana29524 ай бұрын
Fix your camera thats annoying AF
@devlearnllm4 ай бұрын
Sounds like you don’t like the swiveling on it
@rwz3 ай бұрын
Please do not move the camera all the time
@haganlife3 ай бұрын
Definitely loosen up the tracking to center. OSBTail?
@devlearnllm3 ай бұрын
It's actually built-into the DJI Pocket 3 camera. I just had it for a few weeks. Just need to find the settings for it.
@forrest7143 ай бұрын
@@devlearnllm change the follow speed to slow instead of fast.
@PineState773 ай бұрын
What’s the best way to get in touch?
@devlearnllm3 ай бұрын
Details in the video’s description
@thisiswillАй бұрын
The motion-tracking is a bit distracting.
@kungfooman3 ай бұрын
"how to block these fuckin idiots AWS servers to protect your website" next
@PedroIvo-iz5sv4 ай бұрын
it works in portuguese?
@flor.77974 ай бұрын
none of these seem better than Trafilatura?
@flor.77974 ай бұрын
scrapegraph looks cool though
@devlearnllm4 ай бұрын
@@flor.7797 How's your experience using Trafilatura? I haven't tried that yet
@flor.77974 ай бұрын
@@devlearnllm I’m more into main content extraction and boilerplate removal. There isn’t one size fits all unfortunately
@jarg74 ай бұрын
broken link to github
@devlearnllm4 ай бұрын
Yeah there’s something weird with GitHub not displaying the notebook right. The link is the same.
@kevinlukejr.89963 ай бұрын
Fire crawl is to to expensive
@mrRambleGamble3 ай бұрын
The camera moves too much
@devlearnllm3 ай бұрын
its the worst
@mrRambleGamble3 ай бұрын
@@devlearnllm Aside from that, great video.
@pcebro3 ай бұрын
You should definitely wear pants.
@chetanesque1583 ай бұрын
intersting! although I was distracted by your attire... Seriously I was not born 30 years ago man, but can we dress a bit better for a presentation?!
@devlearnllm3 ай бұрын
Lol what's wrong with my wardrobe
@chetanesque1583 ай бұрын
@@devlearnllm Hi! Sorry, but think about it. You are doing everything right, then why dress up like that? Why not better?