Web Scraping for LLM in 2024: Jina AI Reader API, Mendable Firecrawl, and Crawl4AI and More

  Рет қаралды 24,664

Prompt Engineering

Prompt Engineering

Күн бұрын

Пікірлер: 43
@engineerprompt
@engineerprompt 8 ай бұрын
If you want to build robust RAG applications based on your own datasets, this is for you: prompt-s-site.thinkific.com/courses/rag
@lorenzo.padoan
@lorenzo.padoan 6 ай бұрын
Thanks for mentioning ScrapeGraphAI, I'm one of the co-founders, we have implemented new features like code generator for scraping to minimize the number of calls to LLM on sites that have a shared structure on different pages, we are preparing something big related to KG, stay tuned :))))
@unclecode
@unclecode 8 ай бұрын
Thanks for mentioning Crawl4Ai! I'm adding some new features, such as extracting all media tags (video, image, audio), Breadth-First Search (BFS) Crawling, and more. I do it with the aim to generate quality data without relying on large language models (LLM). I think firing up GPUs for just crawling data from a page with billions of parameters is a bit over the top. Developers can use LLMs themselves once they have the right raw data from web sources.
@engineerprompt
@engineerprompt 7 ай бұрын
Crawl4AI is shaping up pretty nicely. I will do a deep dive on it.
@mjacfardk
@mjacfardk 8 ай бұрын
Yes PLEASE, Do a videos on {Crawl4Ai and ScrapeGraphAI}, and thank you for everything you do and your time 🙏
@engineerprompt
@engineerprompt 8 ай бұрын
Yes, its on my list.
@TimTruth
@TimTruth 8 ай бұрын
I just use selenium web driver and JavaScript or Jquery to interact with and get the parts of pages I want. If they use cloud flare or other bot blocking you can run js in console and utilize the copy command then paste in a txt file
@d.d.z.
@d.d.z. 8 ай бұрын
Is there any path for learning you can recommend me? i´m generating reports from a web using python, looking for an alternative. Thanks in advance.
@ahassan7270
@ahassan7270 8 ай бұрын
Thank you so much for sharing this valuable information. It is absolutely helpful.
@engineerprompt
@engineerprompt 7 ай бұрын
Glad it was helpful!
@GetzAI
@GetzAI 8 ай бұрын
Great review. Please do a review on ScrapeGraphAI. Maybe a comparison to Uncle Code's Crawl4AI? I like Crawl4AI and hope UC incorporates PDF options.
@engineerprompt
@engineerprompt 8 ай бұрын
thanks, yes, both of them are on my TODO list.
@beemerrox
@beemerrox 7 ай бұрын
Nice comparison! Please continue work on scraping for AI applications. Hot topic!
@engineerprompt
@engineerprompt 7 ай бұрын
thanks, will do
@sethhavens1574
@sethhavens1574 6 ай бұрын
Super handy, thanks 🙏
@jarad4621
@jarad4621 8 ай бұрын
For jina reader Api key free for 1 million tokens which was 570 sites then pay 10 for 500 mil worth is 250k sites which is totally insane just pay the tiny amount for much better rate limits
@j4cks0n94
@j4cks0n94 7 ай бұрын
Scrapegraph is pretty amazing, highly recommended
@MeinDeutschkurs
@MeinDeutschkurs 8 ай бұрын
Crawl4ai sounds perfect!
@SeeFoodDie
@SeeFoodDie 8 ай бұрын
Thank you. If you could dive deeper into scrapegraph, specifically the knowledge graph feature.
@engineerprompt
@engineerprompt 8 ай бұрын
thanks, will look into it.
@AJ-lg4zr
@AJ-lg4zr 5 ай бұрын
Can you make a detailed video on scrapegraphai? It’s kinda buggy right now for me
@ahassan7270
@ahassan7270 8 ай бұрын
Thank you so much for sharing this valubale information. It is absouletly helpful. But, is it possible,as far as jina ai is concerned, to specify in the code the number of pages that I want to scrape, as spmetimes the pdf file has more than 500 pages .
@engineerprompt
@engineerprompt 8 ай бұрын
I am not sure, their api seems to be very simple and I haven't noticed any customizations yet.
@ai-whisperer
@ai-whisperer 8 ай бұрын
brilliant 🙌🙌
@engineerprompt
@engineerprompt 8 ай бұрын
thanks :)
@bardaiart
@bardaiart 7 ай бұрын
Thanks a lot! :)
@JPy90
@JPy90 8 ай бұрын
great thx!
@stefleur
@stefleur 8 ай бұрын
Probably a silly question, but in what is all this complicated proccess better than doing a simple copy paste from the url?
@engineerprompt
@engineerprompt 8 ай бұрын
There are a couple of reasons. 1. Even if you were to just copy and paste, the you not preserve the structure in most cases, there will be table, images etc which will mess up the formatting. 2. Even if copy paste were to give you perfect results, you can scale that to 100s or 10,000s of webpages. Using these automated tools, you need to provide list of urls and they will be able to parse at scale.
@hypnoz7871
@hypnoz7871 2 ай бұрын
Good luck copy pasting+ cleaning millions of pages for llm feeding. Also good luck for manual updating :)
@chuckcarlson7940
@chuckcarlson7940 6 ай бұрын
Do any of these solutions work on sites you have to log in to? You can give them a url, but if the site requires you to log in, you will not be able to scrape further.
@engineerprompt
@engineerprompt 6 ай бұрын
Good question, I am not sure. you might have to add authentication yourself to these.
@chuckcarlson7940
@chuckcarlson7940 6 ай бұрын
@@engineerprompt If any of these solutions are Chromium based, then one could load the page, go through the authentication process, and select the page to be scraped. Then invoke the scraping tool.
@planetgamecommunity817
@planetgamecommunity817 8 ай бұрын
I need this materials very much,, can you share codes and api brothe??
@engineerprompt
@engineerprompt 8 ай бұрын
link to the notebook is in the video description.
@planetgamecommunity817
@planetgamecommunity817 8 ай бұрын
@@engineerprompt thanks this is crucial ...best for you dude
@john_blues
@john_blues 8 ай бұрын
The android in the thumbnail looks like he's DJing. Like he's ready to drop a sick beat...NOW!
@ppp3812
@ppp3812 8 ай бұрын
Are there any scrapper available for LinkedIn and Instagram?
@engineerprompt
@engineerprompt 8 ай бұрын
I am not aware of any.
@thesimplicitylifestyle
@thesimplicitylifestyle 8 ай бұрын
We must create order from the messiness! 😎🤖
@engineerprompt
@engineerprompt 7 ай бұрын
Agree :)
Qwen-Agent: Build Autonomous Agents with The Best Open Weight Model
19:34
Prompt Engineering
Рет қаралды 9 М.
Create Your LOCAL Llama Web Scraper | Free AI Scraper
13:24
Reda Marzouk
Рет қаралды 22 М.
It works #beatbox #tiktok
00:34
BeatboxJCOP
Рет қаралды 41 МЛН
It’s all not real
00:15
V.A. show / Магика
Рет қаралды 20 МЛН
BAYGUYSTAN | 1 СЕРИЯ | bayGUYS
36:55
bayGUYS
Рет қаралды 1,9 МЛН
The Biggest Issues I've Faced Web Scraping (and how to fix them)
15:03
Turn ANY Website into LLM Knowledge in SECONDS
18:44
Cole Medin
Рет қаралды 168 М.
This is How I Scrape 99% of Sites
18:27
John Watson Rooney
Рет қаралды 262 М.
Anthropic’s Blueprint for Building Lean, Powerful AI Agents
28:25
Prompt Engineering
Рет қаралды 38 М.
Free Scraper Turns ANY WEBSITE into LLM Knowledge INSTANTLY
13:34
Income stream surfers
Рет қаралды 27 М.
AI Is Making You An Illiterate Programmer
27:22
ThePrimeTime
Рет қаралды 237 М.
Industrial-scale Web Scraping with AI & Proxy Networks
6:17
Beyond Fireship
Рет қаралды 792 М.
This is how I scrape 99% websites via LLM
22:44
AI Jason
Рет қаралды 227 М.
It works #beatbox #tiktok
00:34
BeatboxJCOP
Рет қаралды 41 МЛН