Web Scraping NBA Games With Python [Full Walkthrough W/Code]

  Рет қаралды 29,467

Dataquest

Dataquest

Күн бұрын

We'll learn how to scrape NBA box scores with python and combine them into a pandas DataFrame that you can use for machine learning or data analysis.
We'll download standings data and box scores using playwright, parse them with beautifulsoup, and clean them up using pandas. By the end, we'll have a DataFrame with NBA stats from several seasons.
You can see the full code here - github.com/dataquestio/projec... .
Chapters
00:00 Introduction
01:07 Scraping box scores
01:55 - Scraping NBA standings with playwright
23:00 - Parsing NBA standings with BeautifulSoup
28:56 - Parsing box score links with BeautifulSoup
34:27 - Downloading box scores with playwright
40:16 - Parsing box scores with beautifulsoup
48:32 - Reading the line score with pandas
52:53 - Reading stats tables with pandas
55:42 - Getting stats ready for machine learning
1:15:49 - Wrap-up and next steps
---------------------------------
Join 1M+ Dataquest learners today!
Master data skills and change your life.
Sign up for free: bit.ly/3O8MDef

Пікірлер: 82
@birasafabrice
@birasafabrice Жыл бұрын
a new sub is gained, thank you for this tutorial!
@ling6701
@ling6701 Жыл бұрын
beautiful project, thank you.
@joaothomazlemos13
@joaothomazlemos13 Жыл бұрын
Hello! As I want to do a personal project for my portfolio ( as im trying to get my first data scientist job) with the nba theme that i became recently a big fan of, I wanted to do the project from zero, which means scrap. The thing is, scraping was the only thing that i had zero knowledge. I found this video that is absolute pure gold. Im on windows, so i had to use sync mode, and changed a few things. Its working! I also tried to impersonate a little things and I commented the whole code. I'd love to get in touch with you, for some insights from now on, so the project is not a copy of yours, per say. thank you for the video, these kind of knownledge is much needed! Cheers from Brazil.
@Dataquestio
@Dataquestio Жыл бұрын
Glad it helped you!
@tenienteale
@tenienteale Жыл бұрын
hello, thanks for the tutorial!
@mkoller
@mkoller Жыл бұрын
Nice tutorial! I’m still waiting for the data to download. Impressive if you actually did the entire project with Jupyter Notebooks. I had the Windows Playwright issue everyone is talking about, so I used Pycharm. Ran out of memory so I had to run from the command line. Curious. Why did you make an opp column? You had rows of the same data without it, no?
@shukkkursabzaliev1730
@shukkkursabzaliev1730 Жыл бұрын
Hey! Thanks for amazing tutorial. I can't understand one thing. All these features we are preparing for the ML model to train on, however if we want to predict future games these features wont be available. So what will be the inputs for the potential trained model?
@kennethcolombe5579
@kennethcolombe5579 Жыл бұрын
If you are coming from this with some knowledge about basketball, the "standings" mentioned are not actually standings, but the game schedule for that month. It was throwing me off a bit when continuously referenced...not sure it bothers anyone else but thought it worth mentioning.
@Slicneil1
@Slicneil1 8 ай бұрын
Thanks for the video i follow all your work. the issue i am having is continuous timeout error when trying to scrape the data and ideas to get around it?
@FlisB
@FlisB Жыл бұрын
Nice tutorial. I am just curios what is the purpose of opening a browser with playwright. Why not just use the requests library to get the html?
@sohanverma3255
@sohanverma3255 Жыл бұрын
cool
@tomkmb4120
@tomkmb4120 Жыл бұрын
Just noticed your responses below, I'll have to try the code as a Pycharm file and see how I get on
@meechmiliyan8965
@meechmiliyan8965 Жыл бұрын
Awesome stuff!! I am looking to parse box scores for player data. I would like to get player stats AVG and ideally get AVG for Opponent Defensive stats . Could you suggest next steps?
@FlisB
@FlisB Жыл бұрын
You want average stats for players? I want to do something similar. Well I want to get moving averages of players, so that I will predict their points scored in the next games.
@SharpeLocks
@SharpeLocks Ай бұрын
@@FlisB did you ever figure out how to do this?
@cap_smok3r
@cap_smok3r Жыл бұрын
Hello ! Thanks for the great tutorial. I am an NBA fan and data nerd myself, and was wondering why you did not make use of the 'nba_api' to get the most up to date data of game ? And if someone does use it, is there a way to build a ticker to predict the win probability of your favorite team(s) next game in an ongoing season ?? Thanks again for the great content !
@AS-rg9ly
@AS-rg9ly 11 ай бұрын
The NBA api might only be for private use. I know the NFL api is.
@cap_smok3r
@cap_smok3r 11 ай бұрын
@@AS-rg9ly it is not. I have tried it myself.
@nishchay89
@nishchay89 Жыл бұрын
Hi!! I have one query. Why did we take the max of each stat? What is the purpose behind it?
@garymichalske2274
@garymichalske2274 Жыл бұрын
Thanks for the detailed explanation. Since I'm on Windows, I couldn't use Jupyter to run the code so I've been trying your first option of using a Python IDE (I'm using PyCharm). I imported "from playwright.sync_api import sync_playwright" and eliminated the "async" and "await" keywords throughout the code. I was able get all the standings pages (after a few timeouts) and was getting excited with the success! But am having issues with the boxscore pages. The code starts with the April 2016 Standings file and is able to successfully save three of the boxscore files but will start timing out on the fourth one and eventually throw this error..."UnicodeEncodeError: 'charmap' codec can't encode character '\u010d' in position 38876: character maps to " When it does, the related .html file is blank. Firefox seems to work a little better than chrome as it doesn't timeout as often. Any idea of how to get this to work?
@nemanjatamindzija58
@nemanjatamindzija58 Жыл бұрын
add encoding UTF-8 into the line " with open(save_path, "w+", encoding="utf-8") as f: f.write(html) " that is at the end of the scrape_game function, hope it helps.
@edsonarthurzancheta3052
@edsonarthurzancheta3052 Жыл бұрын
@@nemanjatamindzija58 Thank you, i was having the same problem
@andresjacome3315
@andresjacome3315 11 ай бұрын
Hi! Can you share the final code that you have please? For this project becuase I have the same problem of windows @garymichalske2274
@tomphillips5513
@tomphillips5513 6 ай бұрын
Hello! I am doing a similar project but for the NFL. Firstly, is it okay to scrape data from the football reference website, their T&C's are rather unclear. Also, Unlike the basketball reference website where you have to iterate through the months to get all the games, you do not have to do this on the football reference website, therefore, I am wondering how I would have to amend that part of the code. Any help would be very much appreciated as this is for my Final Year Project (Dissertation) at university. Have a great day one and all.
@ScottRachelson777
@ScottRachelson777 Жыл бұрын
How is Playwright different from BeautifulSoup which also grabs HTML from website pages?
@herreramoralesjoseroman1504
@herreramoralesjoseroman1504 Жыл бұрын
Help!.. I couldn't instantiate the browser in the "get_html" function, I already changed p.firefox.launch() to p.chromium.launch()... is it necessary to execute any previous command to install the browsers for the library "playwright"..?
@Dataquestio
@Dataquestio Жыл бұрын
I showed it in the video - you need to run `playwright install` in the command line, or `!playwright install` in jupyter notebook to install the browsers.
@peperecabarren4536
@peperecabarren4536 Жыл бұрын
Hello boss, is it normal to run the scrape season a few times to gather all the data if some of them timeout? Thank you for your time.
@nicksteele6578
@nicksteele6578 Жыл бұрын
I had this issue, i upped the request,retries and timeout time and I have all the data now.
@peperecabarren4536
@peperecabarren4536 Жыл бұрын
@@nicksteele6578 thanks homie
@kirillprokhodtsev6249
@kirillprokhodtsev6249 4 ай бұрын
Hi! Trying to get 'line score' table, but without success. Table not found Selenium method doesn't approach here, because it takes a lot of time + I scare my laptop will bloom. Did someone meet the same problem and solve it?
@jordankasowski284
@jordankasowski284 Жыл бұрын
How do you get past the cookie wall. I can't download the proper HTML because of cookies
@thebinarybin
@thebinarybin 23 күн бұрын
Never could get Chromium to work. I looked everywhere to find a solution for a very long time. So I ended up using Firefox as well. Does anyone have a solution to the chromium issue to direct to me. I really want to figure that out. Great job with the video! Very intuitive. Wish more content was on the regular.
@OBBBB17
@OBBBB17 4 ай бұрын
I'm trying to scrape with playwright but PlaywrightTimeout isn't working and I keep getting invalid syntax. Cell In[5], line 13 except PlaywrightTimeout: SyntaxError: invalid syntax
@jerryli2276
@jerryli2276 22 күн бұрын
Hi, during the parsing part, when I run the code till if len(games) % 100 == 0: print(f"{len(games)} / {len(box_scores)}"), it keeps telling me the error: html5lib not installed, even if I have installed it myself. Could you help me with it?
@f1ip_br
@f1ip_br Жыл бұрын
Trying on both Jupyter and PyCharm and getting the same error on the parse_data part. When running it, it throws a ValueError with ----> 6 line_score = read_line_score(soup) In the box_scores loop, tracing back to 1 def read_line_score(soup): ----> 2 line_score = pd.read_html(str(soup), attrs = {'id': 'line_score'})[0] Ending message is "ValueError: No tables found" Have checked and double checked the code, including running the version in github, but no way to get it to work. Any ideas? Thank you for an excellent tutorail.
@jerryli2276
@jerryli2276 22 күн бұрын
got the same problem going on. Have you figured it out, bro?
@f1ip_br
@f1ip_br 21 күн бұрын
@@jerryli2276 Not really. Started doing some different stuff to learn python, forgot about this project, never went back.
@unclexphil7874
@unclexphil7874 Жыл бұрын
i’m having trouble installing playwright can anyone help?
@joeguerby
@joeguerby Жыл бұрын
Hello, thanks for this amazing tuto. Anyone else had an error while installing playwright ? Me i got the "playwright is not recognized as an internal or external error message" both in command line or in Jupyther notebook. Can anybody help me please ?
@Dataquestio
@Dataquestio Жыл бұрын
You would need to run `pip install playwright` in the command line, or `%pip install playwright` in Jupyter. (remove the `, that's just to show which part is the command).
@joeguerby
@joeguerby Жыл бұрын
@@Dataquestio I did it, but the '!playwright install' failled and i don't know why (You said that we must run this also in jupyther or command line . This is the error message i got : 'playwright' is not recognized as an internal or external command, operable program or batch file. Another request can you add the current season results in the CSV files availlable in the project files ? Please
@julesdrums6167
@julesdrums6167 Жыл бұрын
Trying to re-run just the get_data.ipynb in Jupyter Lab on my local machine. Have changed .p.chromium.launch() to p.firefox.launch() in get_html() and am still getting the "Timeout error on {url}" when I run `for season in SEASONS: await scrape_season(season)` Any tips?
@julesdrums6167
@julesdrums6167 Жыл бұрын
Update: couple of nifty tricks to get this part to work. Change `retries` in get_html to at least 5, and you will probably still run into the timeout issue which causes either BeautifulSoup or f.write(html) to error out, so what you need to do is keep running the code over and over again, and keep an eye on the standings directory. As it populates with each season's month's htmls, modify the seasons variables to exclude those years (e.g. change it from SEASONS = list(range(2016,2024)) to SEASONS = list(range(2017,2024)) and keep iterating up that lower bound as needed).
@dimz130588
@dimz130588 Жыл бұрын
same here
@nicksteele6578
@nicksteele6578 Жыл бұрын
I can't get all the data to scrape any suggestions??!!
@kinetiksports
@kinetiksports 2 ай бұрын
I need help!! Can I email you with an error message I keep getting?
@Jollyjoky
@Jollyjoky Ай бұрын
Hi I keep getting Notimported error when trying to do this project in Windows """Create subprocess transport.""" --> 524 raise NotImplementedError Could someone help me? How to I correct it? I'm running on Windows and vscode
@hybridinc1035
@hybridinc1035 Жыл бұрын
This whole thing is not working. Tried like a thousand times, kept getting the same error. I can send screenshot if possible
@coconutnut21
@coconutnut21 6 ай бұрын
code gives this result "NotImplementedError: Output is truncated. View as a scrollable element or open in a text editor. Adjust cell output settings..."
@Laochuang1
@Laochuang1 5 ай бұрын
I have the same problem, did you manage to solve it?
@ryanbiancavilla9921
@ryanbiancavilla9921 5 ай бұрын
Also getting this error
@zakyvids6566
@zakyvids6566 Жыл бұрын
Please make a python crashcourse
@jomagabarsa
@jomagabarsa 3 ай бұрын
I tried to replicate it but it didn't work for some reason. When I call the get_html function I get a not implemented error but it doesn't say anything. Nice tutorial though
@optimist4472
@optimist4472 Жыл бұрын
the program shows NotImplementedError after executing "html = await get_html....." and I have done every step as you have shown
@Dataquestio
@Dataquestio Жыл бұрын
It looks like there is an issue with playwright and Jupyter on certain versions of Windows/Python (see issue at github.com/scrapy-plugins/scrapy-playwright/issues/7 ). Your options: * Put the code into a regular `.py` file and run it as a python script (not in Jupyter notebook) (easiest) * Install windows subsystem for linux and run jupyter notebook using wsl * Try to upgrade your version of Python/Jupyter and see if that works
@davidichoho5788
@davidichoho5788 Жыл бұрын
@@Dataquestio please i'm having the same issue and im using window
@marcgold424
@marcgold424 Жыл бұрын
using windows 10, VSC, i get: html = await get_html(url, "#content .filter") SyntaxError: 'await' outside function. we cant make html a global variable or put it in the function huh? can we use something else besides playwright? 😞
@Dataquestio
@Dataquestio Жыл бұрын
If you write your code in a regular python file (no Jupyter notebook), then you can use the Playwright sync api (instead of the async api). You'll have to replace the import with `from playwright.sync_api import sync_playwright, TimeoutError as PlaywrightTimeout`. Then you'll need to remove all the `async` and `await` keywords in the code, and write `with sync_playwright() as p:` inside the `get_html` function. This will remove the need for async entirely. But it won't work with Jupyter notebook, only with a regular python file.
@Migzee34
@Migzee34 Жыл бұрын
@@Dataquestio I wiill try this when I get home later, a few questions if you see this. Are you using windows? and also would you recommend I just substitute with selenium or as you said run it as a python script. Thanks for the content, love the channel
@AkachiIsGod
@AkachiIsGod Жыл бұрын
SyntaxError: 'await' outside function
@pirrisynho
@pirrisynho Жыл бұрын
same here, I scraped with requests and it works...
@mkzzzzzzzzzz1
@mkzzzzzzzzzz1 Жыл бұрын
28:50 How can you run await outside of a function? I don't really use jupyter. I tried something like z = [await scrape_season(x) for x in SEASONS], scrape_season(z) but neither worked. Any help appreciated
@Dataquestio
@Dataquestio Жыл бұрын
You can use await inside Jupyter notebook since everything in Jupyter is already running inside an async event loop. I would recommend stripping out async if you're writing a regular Python script outside of Jupyter. You'll use the Playwright sync api (instead of the async api). You'll have to replace the import of playwright with `from playwright.sync_api import sync_playwright, TimeoutError as PlaywrightTimeout`. Then you'll need to remove all the `async` and `await` keywords in the code, and write `with sync_playwright() as p:` inside the `get_html` function. This will remove the need for async entirely. But it won't work with Jupyter notebook, only with a regular python file.
@keithravid5235
@keithravid5235 Жыл бұрын
@@Dataquestio I tried this and got: " Error: It looks like you are using Playwright Sync API inside the asyncio loop. Please use the Async API instead. " As far as I can tell I'm not using an asyncio loop anymore since I made the changes you mentioned.
@Dataquestio
@Dataquestio Жыл бұрын
If you're running in Jupyter, you need to use async (like in the video). If you're writing a regular .py file and running from the command line, then you can use the sync api like I mentioned in the comment above. You wouldn't get the error that you shared if you're running a regular python script (create a `x.py` file, run using `python x.py` from the command line). Jupyter by default wraps code in an asyncio loop. So anything you run in Jupyter is already running async!
@mkzzzzzzzzzz1
@mkzzzzzzzzzz1 Жыл бұрын
@@keithravid5235 forwarding the message because he replied directly to me so you won't see it. check above/below.
@jonathanschild4092
@jonathanschild4092 Жыл бұрын
I keep getting the below after running the for season in SEASONS loop. I'm writing it in regular python script, vice Jupyter in case that's a factor. playwright._impl._api_types.Error: NS_ERROR_UNKNOWN_HOST
@Laochuang1
@Laochuang1 6 ай бұрын
did you solve it?
@manjunathreddy5566
@manjunathreddy5566 Жыл бұрын
ask exception was never retrieved future: Traceback (most recent call last): File "C:\Users\manjunatha.reddy\AppData\Local\Programs\Python\Python310\lib\site-packages\playwright\_impl\_connection.py", line 224, in run await self._transport.connect() File "C:\Users\manjunatha.reddy\AppData\Local\Programs\Python\Python310\lib\site-packages\playwright\_impl\_transport.py", line 133, in connect raise exc File "C:\Users\manjunatha.reddy\AppData\Local\Programs\Python\Python310\lib\site-packages\playwright\_impl\_transport.py", line 121, in connect self._proc = await asyncio.create_subprocess_exec( File "C:\Users\manjunatha.reddy\AppData\Local\Programs\Python\Python310\lib\asyncio\subprocess.py", line 218, in create_subprocess_exec transport, protocol = await loop.subprocess_exec( File "C:\Users\manjunatha.reddy\AppData\Local\Programs\Python\Python310\lib\asyncio\base_events.py", line 1652, in subprocess_exec transport = await self._make_subprocess_transport( File "C:\Users\manjunatha.reddy\AppData\Local\Programs\Python\Python310\lib\asyncio\base_events.py", line 493, in _make_subprocess_transport raise NotImplementedError NotImplementedError kindly help me above error
@user-vu3pf8fe8b
@user-vu3pf8fe8b Жыл бұрын
I also got the same error. What's wrong? Task exception was never retrieved future: Traceback (most recent call last): File "C:\Users\JU HEE RYONG\anaconda3\lib\site-packages\playwright\_impl\_connection.py", line 224, in run await self._transport.connect() File "C:\Users\JU HEE RYONG\anaconda3\lib\site-packages\playwright\_impl\_transport.py", line 133, in connect raise exc File "C:\Users\JU HEE RYONG\anaconda3\lib\site-packages\playwright\_impl\_transport.py", line 121, in connect self._proc = await asyncio.create_subprocess_exec( File "C:\Users\JU HEE RYONG\anaconda3\lib\asyncio\subprocess.py", line 236, in create_subprocess_exec transport, protocol = await loop.subprocess_exec( File "C:\Users\JU HEE RYONG\anaconda3\lib\asyncio\base_events.py", line 1630, in subprocess_exec transport = await self._make_subprocess_transport( File "C:\Users\JU HEE RYONG\anaconda3\lib\asyncio\base_events.py", line 491, in _make_subprocess_transport raise NotImplementedError NotImplementedError --------------------------------------------------------------------------- NotImplementedError Traceback (most recent call last) in ----> 1 html = await get_html(url, "#content .filter") in get_html(url, selector, sleep, retries) 4 time.sleep(sleep * i) 5 try: ----> 6 async with async_playwright() as p: 7 browser = await p.firefox.launch() 8 page = await browser.new_page() ~\anaconda3\lib\site-packages\playwright\async_api\_context_manager.py in __aenter__(self) 44 if not playwright_future.done(): 45 playwright_future.cancel() ---> 46 playwright = AsyncPlaywright(next(iter(done)).result()) 47 playwright.stop = self.__aexit__ # type: ignore 48 return playwright ~\anaconda3\lib\site-packages\playwright\_impl\_connection.py in run(self) 222 self.playwright_future.set_result(await self._root_object.initialize()) 223 --> 224 await self._transport.connect() 225 self._init_task = self._loop.create_task(init()) 226 await self._transport.run() ~\anaconda3\lib\site-packages\playwright\_impl\_transport.py in connect(self) 131 except Exception as exc: 132 self.on_error_future.set_exception(exc) --> 133 raise exc 134 135 self._output = self._proc.stdin ~\anaconda3\lib\site-packages\playwright\_impl\_transport.py in connect(self) 119 env.setdefault("PLAYWRIGHT_BROWSERS_PATH", "0") 120 --> 121 self._proc = await asyncio.create_subprocess_exec( 122 str(self._driver_executable), 123 "run-driver", ~\anaconda3\lib\asyncio\subprocess.py in create_subprocess_exec(program, stdin, stdout, stderr, loop, limit, *args, **kwds) 234 protocol_factory = lambda: SubprocessStreamProtocol(limit=limit, 235 loop=loop) --> 236 transport, protocol = await loop.subprocess_exec( 237 protocol_factory, 238 program, *args, ~\anaconda3\lib\asyncio\base_events.py in subprocess_exec(self, protocol_factory, program, stdin, stdout, stderr, universal_newlines, shell, bufsize, encoding, errors, text, *args, **kwargs) 1628 debug_log = f'execute program {program!r}' 1629 self._log_subprocess(debug_log, stdin, stdout, stderr) -> 1630 transport = await self._make_subprocess_transport( 1631 protocol, popen_args, False, stdin, stdout, stderr, 1632 bufsize, **kwargs) ~\anaconda3\lib\asyncio\base_events.py in _make_subprocess_transport(self, protocol, args, shell, stdin, stdout, stderr, bufsize, extra, **kwargs) 489 extra=None, **kwargs): 490 """Create subprocess transport.""" --> 491 raise NotImplementedError 492 493 def _write_to_self(self): NotImplementedError:
@NotOnVerg
@NotOnVerg Жыл бұрын
@@user-vu3pf8fe8b Same problem here
@md.shafaatjamilrokon8587
@md.shafaatjamilrokon8587 Жыл бұрын
did you solve it?
@ScottRachelson777
@ScottRachelson777 Жыл бұрын
@@user-vu3pf8fe8b Yep, I got the same error. This is why programming can be so frustrating.
@Digital-Light
@Digital-Light Жыл бұрын
the same problem (
Predict NBA Games With Python And Machine Learning
58:33
Dataquest
Рет қаралды 43 М.
Predict Baseball Stats using Machine Learning and Python
54:59
Каха ограбил банк
01:00
К-Media
Рет қаралды 3,5 МЛН
ГДЕ ЖЕ ЭЛИ???🐾🐾🐾
00:35
Chapitosiki
Рет қаралды 16 МЛН
Китайка и Пчелка 4 серия😂😆
00:19
KITAYKA
Рет қаралды 3,7 МЛН
ML Was Hard Until I Learned These 5 Secrets!
13:11
Boris Meinardus
Рет қаралды 206 М.
This AI Agent can Scrape ANY WEBSITE!!!
17:44
Reda Marzouk
Рет қаралды 39 М.
The NBA Data Scientist
8:06
Bloomberg Originals
Рет қаралды 537 М.
How to scrape SPORTS STATS websites with Python
12:53
John Watson Rooney
Рет қаралды 57 М.
Stanford's FREE data science book and course are the best yet
4:52
Python Programmer
Рет қаралды 656 М.
How I Would Learn Data Science (If I Had to Start Over)
8:36
Ken Jee
Рет қаралды 1,4 МЛН
How to Scrape NBA Stats API with Python
15:01
Learn With Jabe
Рет қаралды 16 М.
Каха ограбил банк
01:00
К-Media
Рет қаралды 3,5 МЛН