Web Scraping NBA Stats With Python: Data Project [Part 1 of 3]

  Рет қаралды 40,958

Dataquest

Dataquest

Күн бұрын

This is part 1 of a 3-part series where we predict which NBA player will win MVP.
In this part, we'll download the NBA data we need by using web scraping. To do the scraping, we'll use python, with the selenium, beautifulsoup, pandas, and requests libraries. We'll download the files using requests and selenium, then parse them with beautifulsoup and load them into pandas DataFrames.
By the end, we'll have csv files that we can then merge and use to make predictions in parts 2 and 3 of this series.
The code that we write in this video can be found here - github.com/dataquestio/projec... .
Chapters:
00:00 Introduction
01:10 - A look at the pages we'll scrape
04:22 - Downloading MVP votes with requests
09:26 - Parsing the votes table with beautifulsoup
17:00 - Combining MVP votes with pandas
19:25 - Downloading player stats
22:41 - Using selenium to scrape a Javascript page
28:44 - Parsing the stats with beautifulsoup
32:11 - Combining player stats with pandas
33:38 - Downloading team data
37:29 - Parsing the team data with beautifulsoup
40:44 - Combining team stats with pandas
42:33 - Next steps with this project
Disclaimer: The website we'll be scraping from, Basketball Reference, allows web scraping as long as the scraping doesn't harm site performance. Not all websites allow scraping, so make sure to check the site terms before doing any scraping.
---------------------------------
Join 1M+ Dataquest learners today!
Master data skills and change your life.
Sign up for free: bit.ly/3O8MDef
#PythonTutorial #WebScraping #Python #Dataquest#Importing #Data

Пікірлер: 132
@vikasparuchuri
@vikasparuchuri Жыл бұрын
Hi everyone! You can find all of the code here - github.com/dataquestio/project-walkthroughs/blob/master/mvp/web_scraping.ipynb . If the scraping doesn't get results, try setting a longer sleep timeout so you don't get rate limited by the server. 10-15s should work.
@ScottRachelson777
@ScottRachelson777 Жыл бұрын
How do you set the sleep timeout?
@JAswoosh
@JAswoosh 11 ай бұрын
@@ScottRachelson777 I think it is time.sleep(15) *that is for 15 seconds of sleep request to the website*
@tmjz7327
@tmjz7327 7 ай бұрын
call time.sleep(10) at each iteration when you're looping over years@@ScottRachelson777
@dogden95
@dogden95 Жыл бұрын
The content from you guys is outstanding! Thanks for everything!
@AngelProceeds
@AngelProceeds 2 жыл бұрын
Incredibly helpful! Thank you so much!
@Dataquestio
@Dataquestio 2 жыл бұрын
Thanks, Angel!
@chaimookie935
@chaimookie935 Жыл бұрын
AMAZING! Thank you much for this.
2 жыл бұрын
Nice tutorial Thank you!
@irenenafula8694
@irenenafula8694 Жыл бұрын
When viewing the list of years, you can use print(years, end=" ") to print them horizontally so that you do not have to scroll.
@cwilliams2654
@cwilliams2654 2 жыл бұрын
Awesome you guys should add a selenium tutorial to the API and Web scraping course
@Dataquestio
@Dataquestio 2 жыл бұрын
Great idea! I'll pass that on to our content team.
@mikedigregorio509
@mikedigregorio509 2 ай бұрын
Nice video!🔥
@fabianaandrade560
@fabianaandrade560 2 жыл бұрын
Thank you for this wonderful class 🤩
@kurtji8170
@kurtji8170 Жыл бұрын
Hello, Vik! Thank you for your content! I am wondering if you could post some instructions on how to set sleep timeout for this specific case? I am having this issue and I saw many people in the comments with it too. Many thanks!
@thebinarybin
@thebinarybin 6 ай бұрын
I do realize this is a young 1yr old video. But even before today I have encountered so many problems from uninstalled moduels etc.. This video is a helpful point only if you ensure the proper installs for the tutorial. And you guys have to retrace so many steps to get there. I am bummed out you couldnt even provide proper dependencies before you get coding.
@ofgmora
@ofgmora Жыл бұрын
Thank you for sharing such an amazing video. It helped me a lot. I have a question, how can i get the player ID. It is not visible in the table, so when scratching the table we only get the players name. But when inspecting the code there is a variable for player ID.
@basiliogoncalves8956
@basiliogoncalves8956 2 жыл бұрын
Great video thanks a lot!! I just noticed one thing when using "open()" in need to add'encoding="utf-8" ' such as: with open(filename,'r', encoding="utf-8") as f: page = f.read() ...
@Dataquestio
@Dataquestio 2 жыл бұрын
Thanks, Basilio! I think this might be specific to your system - some systems (like mine) default to "utf-8", so you don't need to add it. It looks like your system might default to a different encoding. Thanks for the tip!
@basiliogoncalves8956
@basiliogoncalves8956 2 жыл бұрын
@@Dataquestio I thought it was something similar for sure! Just thought of adding here in case someone has a similar issue 😊 Again thanks a lot for the amazing video 👍
@jyotgill2963
@jyotgill2963 2 жыл бұрын
This was helpful, thanks Basilio!
@taxtr4535
@taxtr4535 Жыл бұрын
You da GOAT no cap
@calebappiagyei9555
@calebappiagyei9555 Жыл бұрын
Is there a reason why a table is only able to be created when using the mvp or roy ids but not any of the other awards? I am attempting to do this project but with the most improved player results but it seems it is not able to identify the mip table. However, I noticed that the mip and roy tables worked just fine.
@allinhiphop9962
@allinhiphop9962 4 ай бұрын
I noticed some part of the code where I had to add with open("mvp/{}.html".format(year), "w+", encoding='utf-8') as f: Thanks alot for the great work.
@EmeritoMontilla
@EmeritoMontilla Жыл бұрын
It is more a programming class than a data analytics
@aradbeneliezer7129
@aradbeneliezer7129 Жыл бұрын
i am getting an error that No such file or directory: 'mvp/1991.html' in the start of the scraping process, what can i do ? i have a folder named mvp in the same directory as the notebook. i am working in chrome
@JAswoosh
@JAswoosh Жыл бұрын
Can anyone tell me why .format and .get arent blue (working) in jupyter notebook?
@_Kysa_
@_Kysa_ 2 жыл бұрын
Hello, at the part you specify Id=”mvp”, what happens when the id is not there? Is there another way, just with the class or something similar? Thanks and Great video!
@Dataquestio
@Dataquestio 2 жыл бұрын
Hi there - yes, you can select elements based on class, or based on position within the page. More info is here - stackoverflow.com/questions/24801548/how-to-use-css-selectors-to-retrieve-specific-links-lying-in-some-class-using-be .
@duloo97
@duloo97 11 ай бұрын
But you could just for example copy the link of that web page in Power BI and import specific table from the page then even modify it in Power Querry and then export it somewhere as Excel file, or CSV or whatever...
@Monkeyfist2021
@Monkeyfist2021 2 жыл бұрын
Really useful video! I am using the webscraping component within my Matsers to use within the Machine Learning code we have been taught in class (tensorflow). The chromedriver required me to download chrome version 99.0.4488.51 & worked well. I needed to add in this bit of code to get it to work. (only required in player section) tHeads = soup.findAll('tr', class_="thead") for tHead in tHeads: tHead.decompose() Clear, conscience & asthentically pleasing video! Thank you for your help mate from Brisbane, Australia! 🇦🇺
@Dataquestio
@Dataquestio 2 жыл бұрын
Thanks a lot for sharing the code!
@alexandercardoza23
@alexandercardoza23 Жыл бұрын
Wanted to try this with baseball reference but instead the table being html, it’s shtml. Any way to get around that?
@oscarhurtado7107
@oscarhurtado7107 Жыл бұрын
How can I know which parts of a website are being displayed by executing javascript using the inspections tools, and not downloading the html fist and comparing them?
@DilzOnlineHD
@DilzOnlineHD Жыл бұрын
Lovely video. My issue is when I try to extract the HTML my JupyterNotebook is stuck on "loading" for the table I need which is frustrating. It will load two tables on a page but any more it won't do it. If anyone has a solution it would be great.
@justinburney7125
@justinburney7125 2 жыл бұрын
At the part where you specify id=‘mvp’, what do you do when the id includes randomly assigned characters (id=‘stats_cd051869_summary’) and a loop is required first to know each game’s unique id, like ‘cd051869’?
@Dataquestio
@Dataquestio 2 жыл бұрын
Hi Justin - you can handle this in a couple of ways: 1. If the same random sequence is re-used across the page, you can just read the string and append it to all the ids for the page. 2. You can do a fuzzy match on id, and just look for "stats_". This will ignore the random sequence. Pandas will let you do this with read_html and the match parameter. 3. You can use the position of the tag to extract it versus the id. So you can look for a p tag inside a div tag, for example.
@jasoningersoll9346
@jasoningersoll9346 2 жыл бұрын
Great video! If I wanted to separate data by teams rather than by years how would I do that?
@Dataquestio
@Dataquestio 2 жыл бұрын
Hi Jason - you mean after scraping the data? You can use the groupby method (pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html) to group by the "Tm" or "Team" column (depending on the DataFrame). This will separate out each team.
@9FM303
@9FM303 5 ай бұрын
now it is not necessary to join the East and West tables since there is the Expanded Standings table
@abdeljalil-ahmed
@abdeljalil-ahmed Жыл бұрын
I have a problem removing unimportant elements from a table ''soup.find("tr", class_="over_header").decompose()' AttributeError: 'NoneType' object has no attribute 'decompose' :Any guidance on solving this problem
@tmjz7327
@tmjz7327 Жыл бұрын
as far as I know, there are two possible issues. Firstly, it is possible that there is no page associated with one of the values in your range from the beginning (for example, in 2004 the NHL was locked out so there was no Hart trophy vote, hence no data for 2005). Or, you re-ran the cells too many times, and downloaded the html pages too many times, so you got rate limited and hence it is trying to decompose data that you have not actually downloaded.
@witnessesofpastsins7832
@witnessesofpastsins7832 Жыл бұрын
Hi Vik! I run into a problem and get a 'UnicodeEncodeError' when I try and write the html to the files and I was wondering if there was a work around? I started off exactly how you have the web_scraping.ipynb file, but for some reason my 3rd cell is not running properly? Any help would be awesome!
@witnessesofpastsins7832
@witnessesofpastsins7832 Жыл бұрын
I used Basilio Goncalves' solution below but now I run into VSCode saying that there is no file or directory called 'mvp/{}.html'
@Dataquestio
@Dataquestio Жыл бұрын
Hi there - did you format the string to add in the year? It should look like this - "mvp/{}.html".format(year)
@ScottRachelson777
@ScottRachelson777 Жыл бұрын
How come Jupyter lab won't let me copy the URL from the Basketball Reference website and then paste it into a Jupyter lab cell? It's easy to do in Jupyter Notebook, but it doesn't work in Jupyter Lab.
@JAswoosh
@JAswoosh Жыл бұрын
Why is my .format not blue or registring??
@gabrielazeredo8677
@gabrielazeredo8677 Жыл бұрын
Hi Vic, I'm having a problem in this part (16:36) with decompose(). Input In [22], in () 4 page = f.read() 5 soup = BeautifulSoup(page, 'html.parser') ----> 6 soup.find('tr', class_='over_header').decompose() 7 mvp_table = soup.find(id='mvp') 8 mvp = pd.read_html(str(mvp_table)) AttributeError: 'NoneType' object has no attribute 'decompose' what should I do? thank you
@jenskaiser4423
@jenskaiser4423 Жыл бұрын
Take a closer look at your html files in the "mvp" folder. Probably some of them are not there because you were banned. Especially in the last years this can happen.
@crispineda4630
@crispineda4630 Жыл бұрын
@@jenskaiser4423 So if we get banned what do we do? That's currently my issue. I saw a few other comments just saying wait it out or try a VPN. When I used the VPN i got a 404 error.
@kennethcolombe5579
@kennethcolombe5579 Жыл бұрын
@@crispineda4630 @Gabriel Azeredo have you managed to resolve this issue? It has been the same for me
@kennethcolombe5579
@kennethcolombe5579 Жыл бұрын
@gabrielazeredo8677
@crispineda4630
@crispineda4630 Жыл бұрын
@@kennethcolombe5579 Idk what happened. But I stopped trying to request for a few days meaning I didn't do anything. I deleted everything from the MVP's Folder. And tried again with no VPN. It worked. up until 2021-22. No matter how many times I tried. It always stopped me from getting the data at 2021-2022. still can't figure out why. I was able to go through the whole project using 1991 to 2020. I'm wondering If i can make a separate request to scrape just 2021 and integrate that last year so my predictions will be a bit more accurate.
@rahulnarayan6878
@rahulnarayan6878 16 күн бұрын
It is taking a lot of time to get pages from year 1991 to 2021 to load in html files in our local comouter and shows timeout error. what should i do?
@auyeungstephen2878
@auyeungstephen2878 Жыл бұрын
i have a problem with encoding when i type with open("mvp/1991.html") as f: page = f.read() soup = BeautifulSoup(page, 'html.parser', encoding='utf-8') soup.find('tr', class_="over_header").decompose() it show that UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 247773: character maps to how i can solve this problem?
@joshbull2064
@joshbull2064 Жыл бұрын
you need to fix the following: with open("mvp/1991.html", "w+", encoding="utf-8") as f: page = f.head()
@sheaturner3229
@sheaturner3229 Жыл бұрын
I wanna learn how to do this for nba first basket picks for fanduel
@alessandrodimattia8946
@alessandrodimattia8946 Жыл бұрын
Ehi Vik! First of all, thanks for the amazing opportunity you are giving me to learn data science for free. Your videos are so clear and accessible, I cannot fully express mt gratitude. Second, does anybody know a website I can use to replicate this project for Football Premier League? Thank a lot to everybody!
@Dataquestio
@Dataquestio Жыл бұрын
Hi Alessandro - this video (kzbin.info/www/bejne/ZprVnnd4jLGlmdE) explains how to predict the Premier league. There's also a previous video in the series that explains how to scrape the data.
@kennethcolombe5579
@kennethcolombe5579 Жыл бұрын
Alessandro, did you figure this out?
@alessandrodimattia8946
@alessandrodimattia8946 Жыл бұрын
@@kennethcolombe5579 Hei Kenneth, hope everything good! Unfrotunately, no. I could not find time to work on this project. It will be super interesting, however, trying to predict the best 11 players based on their market values. Maybe in the future I will consider to do that.
@csowm5je
@csowm5je 2 жыл бұрын
Thank you for this course. The teams still have some 'tr thead', I think it is removing only one. I did this "tHeads = soup.findAll('tr', class_="thead") , for tHead in tHeads, tHead.decompose(). please confirm if this is correct. Not sure if it matters but I was trying to match the number of rows and thought I was not doing it properly. There are 27 teams each year and it should be 837 (27 x 31) not 1033 (41:17) Edit: I see that this is to be fixed later. Thanks,
@Dataquestio
@Dataquestio 2 жыл бұрын
Yes, the way you did it is great! I should have done it this way originally, but since I forgot to use findAll, I ended up cleaning the extra rows in the next part instead.
@user-dt8ei1wj2x
@user-dt8ei1wj2x Жыл бұрын
Thanks for this course!But I still have a problem.In the section of using selenium to scrape the javascript page. When I code"html = driver. page_source'.It sets an error. WebDriverException: Message: unknown error: unexpected command response (Session info: chrome=103.0.5060.134) I don't know what it actually means.It can't scrape all the years information.Sorry to bother you!
@Dataquestio
@Dataquestio Жыл бұрын
Hi there - selenium can be a bit flaky. I would recommend using playwright if you're having issues. There is a short tutorial here - kzbin.info/www/bejne/iXuaqaGeiLGqn5I
@user-dt8ei1wj2x
@user-dt8ei1wj2x Жыл бұрын
@@Dataquestio Wow! Thank you! I'll try it later, hope it can work! You are so nice! 😆
@wangqinjing8336
@wangqinjing8336 2 жыл бұрын
Hi there! I've been struggling with the selenium part of the web scraping. I copied the path of my chromedriver.exe file but it still says that the file is not found. I'm using chrome and windows. Any feedback would be highly appreciated!
@Dataquestio
@Dataquestio 2 жыл бұрын
Hi Wang - selenium can be a bit finicky to install. I'll be posting a video today or tomorrow that uses playwright, which is a newer library that is easier to install than selenium. If you're having trouble installing selenium, I'd recommend that instead. -Vik
@pavanrao8258
@pavanrao8258 Жыл бұрын
@@Dataquestio hi which video uses playwright?
@Aquafina780
@Aquafina780 Жыл бұрын
hello! when i try to run intial data exctraction, it returns 'int' object is not iterable for the "for + in " line pertaining to the years. how should i correct this?
@Dataquestio
@Dataquestio Жыл бұрын
This usually happens when you run a list comprehension or a for loop over an integer. You need to actually run them over a list. So I'd double check that the variable you're iterating over has the value you expect.
@vt-fc6gq
@vt-fc6gq 2 жыл бұрын
Hello ! Thank you for this project. I am using Kaggle and i can't find a solution for the driver (26:36). The error message has 2 informations : At the beginning of the error message : "DeprecationWarning: executable_path has been deprecated, please pass in a Service object" At the end of the error message : "executable needs to be in PATH." Any idea to solve this issue ? Thank you :)
@Dataquestio
@Dataquestio 2 жыл бұрын
Hi! There are two issues here. This StackOverflow answer should help you with the warning - stackoverflow.com/a/69892580 . You need to initialize the Service object and pass it in instead of specifying "executable_path". As for the error, you might also want to ensure that the chrome driver exists at the path you pass in, and can be executed. It looks like the driver can't be found.
@arjanpatell
@arjanpatell Жыл бұрын
When I try to webscrape from basketball reference I get a 404 error saying the webpage isn't found. I also tried it with baseball reference and pro football reference getting the same error. Is this happening because I made too many requests?
@Dataquestio
@Dataquestio Жыл бұрын
Hi Arjan - that's strange that you get a 404 error. Usually too many requests results in a 429 error. A 404 error indicates that the page can't be found. This could happen if your local network blocks those sites (maybe a university network).
@garymichalske2274
@garymichalske2274 2 жыл бұрын
When I try the code regarding year = 1991 and then try data.text, (around 7:07 in the video) I get an error saying "NameError: name 'data' is not defined. I launched Jupyter lab from within Anaconda. Any idea how to get past this?
@Dataquestio
@Dataquestio 2 жыл бұрын
Hi Gary - did you run the previous code cell first? You have to run `url = url_start.format(year)` and `data = requests.get(url)` before you can run `data.text`. Otherwise, nothing will be assigned to the `data` variable.
@garymichalske2274
@garymichalske2274 2 жыл бұрын
@@Dataquestio That's what it was. I guess I should have reviewed a Jupyter Notebook tutorial first. Thanks!
@YotsubaBestGirl
@YotsubaBestGirl 11 ай бұрын
HOW DO I SOUP.FIND A TABLE WITHOUT AN ID TO IT. I TRIED THE CLASS AND IT DID NOT WORK. THANK YOU FOR THE VIDEO!!
@jaredhutchinson4629
@jaredhutchinson4629 Жыл бұрын
When I attempted to decompose() un unwanted row. I'm doing a slightly different page on the website so in this instance it is class_=thead That said, I got the error that 'AttributeError: 'NoneType' object has no attribute 'decompose''. I was told that I need to creat code similar to the following: thead = soup.find('tr', class_="thead") if thead is not None: thead.decompose() ... else: ... Am I doing things wrong? Anything I can do to go about this?
@Dataquestio
@Dataquestio Жыл бұрын
This would happen if there is no `tr` element with class `thead` on the page. I would double-check the html to make sure you have the right tag and class names.
@jaredhutchinson4629
@jaredhutchinson4629 Жыл бұрын
@@Dataquestio is there a way to decompose a specific row? For instance data-row="25"?
@OfficialEricGao
@OfficialEricGao Жыл бұрын
my html files are being written and saved but when i click on each file, there's nothing saved? Is anyone else having this problem?
@JAswoosh
@JAswoosh Жыл бұрын
Hey, Eric hasn't got to that part, do you know why .format or .get functions are not turning blue or working?
@OfficialEricGao
@OfficialEricGao Жыл бұрын
@@JAswoosh whats your code?
@austinsacks5907
@austinsacks5907 Жыл бұрын
what do you do if you get rate limited when trying this
@Dataquestio
@Dataquestio Жыл бұрын
You can try setting a longer sleep between scrapes. A lot of people have reported success that way. I've seen 10-15s work.
@roianthoni9917
@roianthoni9917 Жыл бұрын
@14:00 how could one find multiple id's on an html site ?
@Dataquestio
@Dataquestio Жыл бұрын
Hi Roi - an id is unique across the page (or at least is supposed to be, although not all sites are coded properly). So if an element on a page has a certain id, then no other element will have that id. -Vik
@jacklegnon8439
@jacklegnon8439 Жыл бұрын
What do I do if I get banned by the site?
@bjorncalbes7604
@bjorncalbes7604 2 жыл бұрын
what IDE are you using?
@Dataquestio
@Dataquestio 2 жыл бұрын
I'm using JupyterLab - jupyterlab.readthedocs.io/en/stable/ .
@dandysixties
@dandysixties 2 жыл бұрын
Hi! I'm not being able to run this part of the code driver = webdriver.Chrome(executable_path="/Users/[user]/chromedriver") gives me back PermissionError: [WinError 5] Acceso denegado
@Dataquestio
@Dataquestio 2 жыл бұрын
Hi! This could be for a few reason (it's hard to diagnose from just an error message). - Your path could be incorrect. Did you verify that the chromedriver exec is at the path you specified? - You might not have the executable permission on the driver. You can add this with `chmod +x` (see askubuntu.com/questions/443789/what-does-chmod-x-filename-do-and-how-do-i-use-it) - On mac, you may have to remove the file from mac quarantine because it isn't signed (superuser.com/questions/28384/what-should-i-do-about-com-apple-quarantine) . Note that you do this at your own risk, since it is unsigned! If all else fails, try double clicking on the chromedriver to execute it manually - that may pop up a permission prompt or tell you why the file won't run.
@dandysixties
@dandysixties 2 жыл бұрын
@@Dataquestio first of all, thanks for the complete response! thanks for taking the time. I'm running on windows, and the program is in the right directory. I'll check with "chmod +x", hope it solves it!
@dukaloncic
@dukaloncic Жыл бұрын
what happens when you run into a 429 error from Sports Reference?
@calebappiagyei9555
@calebappiagyei9555 Жыл бұрын
You have to wait an hour because it thinks you are a bot trying to access the site. Try not to run that portion of the notebook too frequently. I also decided to limit it to 20 years instead of 30 since the does not like when you try to access more than 20 things at once.
@ScottRachelson777
@ScottRachelson777 Жыл бұрын
I followed your code exactly and I got this error: AttributeError Traceback (most recent call last) in 3 4 soup = BeautifulSoup(page, 'html.parser') ----> 5 soup.find('tr', class_="over_header").decompose() AttributeError: 'NoneType' object has no attribute 'decompose'
@aniketsingh3570
@aniketsingh3570 3 ай бұрын
You found solution for this?
@dylanalexander2407
@dylanalexander2407 4 күн бұрын
@@aniketsingh3570 Having this error and after investigating html files downloaded, it looks like this was made before they installed a traffic blocker on their page to discourage web scraping. Could be solved by adding a timer
@floppitommi123
@floppitommi123 2 жыл бұрын
i cant beleaeive this has only 8k views i very sad now
@Dataquestio
@Dataquestio Жыл бұрын
I'd love to get more views also, Tommy :) -Vik
@AR-hp2jl
@AR-hp2jl Жыл бұрын
I got banned from basketball reference what did i do wrong and how can I avoid this in the future
@joshbull2064
@joshbull2064 Жыл бұрын
you just need to wait as it is a temporary ban. Alternatively, you can run a VPN to generate a new IP address to get around your one being temp banned
@AR-hp2jl
@AR-hp2jl Жыл бұрын
@@joshbull2064 thanks Josh appreciate the feedback, which VPN do you recommend?
@joshbull2064
@joshbull2064 Жыл бұрын
Using nordvpn is a good way to change IP addresses when you are getting temporarily banned when scraping. Another way to bypass this ban, is using a time.sleep() method. Using this method after each scrape, along with a random integer method, will randomly generate the time.sleep() method with a random integer of time, meaning that the website won't be so inclined to ban you as you are trying to send requests less frequently and at random. Hope this helps
@user-ej5sb8gk8s
@user-ej5sb8gk8s 10 ай бұрын
"disambiguate" lol
@DayOneCricket
@DayOneCricket 5 ай бұрын
again jump straight into jupyter lab - no set up etc
@johndiba1321
@johndiba1321 2 жыл бұрын
It appears my original comment has been hidden as spam. Respond again here if you don't mind. 🙏🏼🙏🏼🙏🏼
@Dataquestio
@Dataquestio 2 жыл бұрын
Hi John - I unfortunately can't see your full original comment, but I think you were asking about a solution for this. You can find the full solution code here - github.com/dataquestio/project-walkthroughs/blob/master/mvp/web_scraping.ipynb
@johndiba1321
@johndiba1321 2 жыл бұрын
@@Dataquestio Thank you! even when copying and pasting your portion of code i get the same error. the error is pointing to the f.write(data.text) line of code a unicode error that i can't figure out ~\anaconda3\lib\encodings\cp1252.py in encode(self, input, final) 17 class IncrementalEncoder(codecs.IncrementalEncoder): 18 def encode(self, input, final=False): ---> 19 return codecs.charmap_encode(input,self.errors,encoding_table)[0] 20 21 class IncrementalDecoder(codecs.IncrementalDecoder): UnicodeEncodeError: 'charmap' codec can't encode character '\u010d' in position 181691: character maps to
@Dataquestio
@Dataquestio 2 жыл бұрын
@@johndiba1321 Do you mind sharing the error message and the lines of code before/after the error?
@johndiba1321
@johndiba1321 2 жыл бұрын
@@Dataquestio gladly UnicodeEncodeError Traceback (most recent call last) ~\AppData\Local\Temp/ipykernel_17596/1422360153.py in 7 8 with open("mvp/{}.html".format(year), "w+") as f: ----> 9 f.write(data.text) ~\anaconda3\lib\encodings\cp1252.py in encode(self, input, final) 17 class IncrementalEncoder(codecs.IncrementalEncoder): 18 def encode(self, input, final=False): ---> 19 return codecs.charmap_encode(input,self.errors,encoding_table)[0] 20 21 class IncrementalDecoder(codecs.IncrementalDecoder): UnicodeEncodeError: 'charmap' codec can't encode character '\u010d' in position 181691: character maps to
@Dataquestio
@Dataquestio 2 жыл бұрын
John, this looks like it could be a unicode encoding issue on your machine. I'm assuming that you're using Python 3. In that case, whenever you're opening a file (for writing or reading), specify the encoding as utf-8. So you'd replace open("mvp/{}.html".format(year), "w+") with open("mvp/{}.html".format(year), "w+", encoding="utf-8") . When you read the file in, you should also specify the encoding parameter.
@javigallego243
@javigallego243 2 жыл бұрын
Hello, I am getting the following error when trying to define the "driver" (27:12). WebDriverException: Message: Can not connect to the Service /opt/google/chrome/google-chrome. Any idea of what the solution could be ?
@Dataquestio
@Dataquestio 2 жыл бұрын
Hi Javi - what path did you pass into the webdriver.Chrome class? Do you have a chromedriver executable at that path? If not, you'll need to download the driver from here - chromedriver.chromium.org/downloads . Second, do you have Google Chrome installed? This won't work if you don't have it installed. You might want to ensure that chrome is available at /opt/google/chrome/google-chrome. There is also a geckdriver (github.com/mozilla/geckodriver/releases/) you can use if you only have Firefox installed.
@infamousprince88
@infamousprince88 2 жыл бұрын
getting an error that No such file or directory: 'mvp/1991.html' on the "import requests" kernel. Then, was able to make a folder and the error changed from that to 'cahrmap' codec can't encode character
@Dataquestio
@Dataquestio 2 жыл бұрын
Joshua, this looks like it could be a unicode encoding issue on your machine. I'm assuming that you're using Python 3. In that case, whenever you're opening a file (for writing or reading), specify the encoding as utf-8. For example, you'd replace open("mvp/{}.html".format(year), "w+") with open("mvp/{}.html".format(year), "w+", encoding="utf-8") . When you read the file in again, you should also specify the encoding parameter.
@infamousprince88
@infamousprince88 2 жыл бұрын
@@Dataquestio thank you thank you!
@infamousprince88
@infamousprince88 2 жыл бұрын
@@Dataquestio ran into another issue with the player_stats_url code and when I tried to do the loop to get all the player data. It is mentioning: "No such file or directory: 'player/2021.html' for the first part and 'No such file or directory: 'player/1991.html' for the loop.
@thiagoduarte7207
@thiagoduarte7207 2 жыл бұрын
@@infamousprince88 you have to create the 'player' folder
@infamousprince88
@infamousprince88 2 жыл бұрын
@@thiagoduarte7207 thank you!
JavaScript Tutorial for Beginners [JS Crash Course 2024]
1:37:14
TechWorld with Nana
Рет қаралды 114 М.
UFC Vegas 93 : Алмабаев VS Джонсон
02:01
Setanta Sports UFC
Рет қаралды 195 М.
Would you like a delicious big mooncake? #shorts#Mooncake #China #Chinesefood
00:30
Is it Cake or Fake ? 🍰
00:53
A4
Рет қаралды 18 МЛН
The Biggest Issues I've Faced Web Scraping (and how to fix them)
15:03
This AI Agent can Scrape ANY WEBSITE!!!
17:44
Reda Marzouk
Рет қаралды 39 М.
The NBA Data Scientist
8:06
Bloomberg Originals
Рет қаралды 537 М.
Web Scraping NBA Games With Python [Full Walkthrough W/Code]
1:19:10
Web Scraping with Python and BeautifulSoup is THIS easy!
15:51
Thomas Janssen | Tom's Tech Academy
Рет қаралды 17 М.
AWESOME Excel trick to scrape data from web automatically
8:35
UFC Vegas 93 : Алмабаев VS Джонсон
02:01
Setanta Sports UFC
Рет қаралды 195 М.