The Ultimate AI Website Scraping Guide

  Рет қаралды 9,312

Mervin Praison

Mervin Praison

Күн бұрын

Dive into the amazing world of web scraping and data extraction with Crawl4AI! In this comprehensive tutorial, we'll explore how to leverage this open-source, LM-friendly tool to automate web crawling, extract valuable data, and integrate it with AI agents. Perfect for developers and AI enthusiasts looking to streamline their data collection process! 🚀
Benefits:
Learn how to set up and use Crawl4AI for web scraping.
Understand the difference between manual and automated data extraction.
Step-by-step guide on converting unstructured data into structured JSON format.
Integration of Crawl4AI with AI agents for advanced data analysis.
Complete workflow demonstration with practical coding examples.
🔗 Useful Links:
Patreon: / mervinpraison
Ko-fi: ko-fi.com/mervinpraison
Discord: / discord
Twitter / X : / mervinpraison
Sponsor a Video or Do a Demo of Your Product: mer.vin/contact/
Code: mer.vin/2024/06/crawl4ai-and-...
Call to Action:
🔔 Subscribe for more AI and tech tutorials!
👍 Like this video to help others discover this amazing tool.
💬 Comment below with your thoughts and any questions you have!
Stay tuned for more videos on AI tools and automation! 🚀
Timestamps:
0:00 - Introduction to Crawl4AI 🌟
0:36 - Benefits of Using Crawl4AI 🛠️
1:03 - Manual vs. Automated Crawling 🆚
1:58 - Installation Steps 📥
2:45 - Basic Web Crawling Example 🌐
3:20 - Converting Unstructured Data to Structured Data 📊
4:54 - Integrating Crawl4AI with AI Agents 🤖
6:00 - Creating and Using Tools.py 📝
6:37 - Running and Analysing the Complete Workflow 🧩
7:04 - Detailed Report and Conclusion 📃

Пікірлер: 33
@Techonsapevole
@Techonsapevole 2 күн бұрын
Impressive, it's the first time I see something with agents which is actually useful. Thanks
@unclecode
@unclecode 4 күн бұрын
Thank you, dear Mervin. I really appreciate your review of my library. Honestly, there's no way I could explain my library and its integration with another cool library in less than 20 minutes. Yet, you managed to do it in just 7-8 minutes. That's your incredible superpower. I'm trying to learn from the way you summarize and explain things 😆. Great job. By the way, I'm happy to see it engaging with your PraisonAI library and look forward to more collaboration. Kudos.
@YTber1
@YTber1 2 күн бұрын
hey while installing this i got following error do you know why it happens. how can i solve it? raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['C:\\Users\\99559\\AppData\\Local\\Programs\\Python\\Python311\\python.exe', '-m', 'pip', 'install', 'spacy', '--no-deps']' returned non-zero exit status 1. [end of output] note: This error originates from a subprocess, and is likely not a problem with pip. ERROR: Failed building wheel for crawl4ai Failed to build crawl4ai ERROR: ERROR: Failed to build installable wheels for some pyproject.toml based projects (crawl4ai)
@unclecode
@unclecode 2 күн бұрын
@@YTber1 can u make an issue about this in repo?
@YTber1
@YTber1 2 күн бұрын
@@unclecode ok!
@godwinspeaks
@godwinspeaks 4 күн бұрын
Will be nice to see on what is possible to be accomplished without spending money on groq, OpenAI or Anthropic.
@john_blues
@john_blues Күн бұрын
Interesting. Web crawlers, even automated ones are old technology. The addition of the LLM is powerful. It allows a more focused semantic search. An app like this also lowers the barrier for use as good web crawlers took a bit of technical knowledge.
@yazanrisheh5127
@yazanrisheh5127 4 күн бұрын
Hi Mervin. Can you do a video on building a RAG streamlit application for multi document types like PDF, CSV etc...
@john_blues
@john_blues Күн бұрын
You were moving pretty fast. How was praisonai using the webscraper tool you created?
@danielimonikhe1053
@danielimonikhe1053 3 күн бұрын
Hi, can open-source LLms be used or just open AI?
@farexBaby-ur8ns
@farexBaby-ur8ns 4 күн бұрын
Think selenium testing cld leverage the use of this, or perhaps I am not in the know with new ai way or app testing
@gnosisdg8497
@gnosisdg8497 4 күн бұрын
what about using ollama and any llm model as well? or any local llm model what so ever?
@theunknown2090
@theunknown2090 4 күн бұрын
Ya if u get the prompts and the steps then run u can use any llm. Only depends on how capable the llm is.
@unclecode
@unclecode 4 күн бұрын
Simply you pass "ollama/MODEL_NAME" for the model name and "no-token" for token, it works well.
@godwinspeaks
@godwinspeaks 4 күн бұрын
wow
@Derick99
@Derick99 3 күн бұрын
Is there privacy concerns here or since is all publicly accessable its fair use?
@rezashah22
@rezashah22 4 күн бұрын
Thanks Marvin. Can we crawl entire site? for example for an ecommerce site, get products information such as title, description, price, image_url? Can we tell crawl4ai to follow the links?
@unclecode
@unclecode 4 күн бұрын
Yes, the crawl result is an object that contains website metadata, external/internal links, media elements, markdown version, extracted content based on extraction strategies, and more. Please refer to the documentation for additional details.
@rezashah22
@rezashah22 3 күн бұрын
@@unclecode Thanks for the reply. Do you have examples how to scrape an entire site?
@vijayakaja5001
@vijayakaja5001 3 күн бұрын
@@rezashah22 do you found any example?
@unclecode
@unclecode 2 күн бұрын
@@rezashah22 Already added to the backlog to include in our example folder, should be done in a week or two.
@darkreader01
@darkreader01 4 күн бұрын
how can we crawl websites that need authentication? Can we add cookies?
@SprintDock
@SprintDock 4 күн бұрын
You could take the raw html from a logged in page
@unclecode
@unclecode 4 күн бұрын
Crawl4AI offers several hooks, one positioned right before retrieving the URL. Here, you have access to the Selenium instance to implement your custom code for different purposes. The documentation contains an illustration of this functionality. It's worth noting that you can execute JS code on the page in case an interaction is required to access the data you need.
@puneetxaxa1777
@puneetxaxa1777 4 күн бұрын
How good is it for cralwing 2000 web pages of 5 different website?
@unclecode
@unclecode 4 күн бұрын
We are conducting a stress test to ensure excellence. Currently performing well, we are implementing a scheduling/queue feature for bulk crawling, which will be available soon.
@williamwong8424
@williamwong8424 4 күн бұрын
can this be done on intranet pages?
@unclecode
@unclecode 4 күн бұрын
Yes, any url address (local, LAN, or WAN) which is accessible on ur machine, should be ok.
@tonywhite4476
@tonywhite4476 5 күн бұрын
What's the token cost?
@MervinPraison
@MervinPraison 5 күн бұрын
It is based on the list of URls you provide and amount of content in each URL. You can also integrate Groq or Other LLM Providers.
@MeinDeutschkurs
@MeinDeutschkurs 4 күн бұрын
Have to test the crawler. I‘m interested in how it manages robot.txt/robots meta tag, and how it handles smart bot recognition.
@envoy9b9
@envoy9b9 4 күн бұрын
@@MervinPraison can i t be used wit ollama?
Reading Laravel - 8 (How exactly Container works?)
44:36
Read Think Write Code
Рет қаралды 14
ScrapeGraphAI - REVOLUTION in WEB SCRAPING!!!
8:23
Thomas Janssen | Tom's Tech Academy
Рет қаралды 6 М.
DO YOU HAVE FRIENDS LIKE THIS?
00:17
dednahype
Рет қаралды 51 МЛН
Неприятная Встреча На Мосту - Полярная звезда #shorts
00:59
Полярная звезда - Kuzey Yıldızı
Рет қаралды 7 МЛН
Web Scraping with ChatGPT is mind blowing 🤯
8:03
Code Bear
Рет қаралды 36 М.
GraphRAG: The Most Incredible RAG Strategy Revealed
10:38
Mervin Praison
Рет қаралды 11 М.
Crawl4AI - Crawl the web in an LLM-friendly Style
18:42
Unclecode
Рет қаралды 3,9 М.
FREE Agentic Website Crawling & Integrate with AI Agents: Firecrawl
8:28
ChatGPT 4o vs Expert Analyst | Data Visualization: Who Does It Better?
17:41
15 INSANE Use Cases for NEW Claude Sonnet 3.5! (Outperforms GPT-4o)
28:54
How I used Claude Sonnet 3.5 To Do My Job
1:31
Jerrod Lew
Рет қаралды 4,1 М.
AI Agents Explained: How This Changes Everything
10:35
Bot Nirvana
Рет қаралды 3,6 М.
easy game filter 😎
0:12
Nandito Creative
Рет қаралды 4,9 МЛН
小天使为了救黑天使,献出自己的眼睛#short #angel #clown
0:50
Was ist im Eis versteckt? 🧊 Coole Winter-Gadgets von Amazon
0:37
SMOL German
Рет қаралды 29 МЛН
My family Orchestra groups performs
0:10
Super Max
Рет қаралды 2,2 МЛН
Random pink food asmr mukbang 📱 #asmr #mukbang #eating #food
0:14
P7 Amazing Gadgets, Kitchen Utensils, Home cleaning, Inventions, Ideas part 4
0:10