Scraping HTML tables into Pandas with read

No video

Scraping HTML tables into Pandas with read_html

Рет қаралды 7,094

Python and Pandas with Reuven Lerner

2 жыл бұрын

There are numerous ways to "scrape" sites into Python. One particularly powerful way is the read_html method in Pandas. In this video, I show you how you can use it to read data in, and then manipulate it, for your own needs.
For free, weekly articles about Python and software engineering, check out my Better developers newsletter at BetterDevelope...

Пікірлер: 24

@dqnnny 2 жыл бұрын

Super useful, subbed!

@ReuvenLerner 2 жыл бұрын

Delighted that it helped!

@huangabigail6747 9 күн бұрын

such a great tutorial. It is very useful. But when use this code, i have an error “Unincodedecode: turf-8….” how to solve it?

@ReuvenLerner 6 күн бұрын

I'm so glad to hear it helped! If you're getting a "Unicode decode" error, it's usually because you got values that cannot be interpreted as legitimate UTF-8 Unicode values. Maybe it's binary data, maybe it's mistakenly labeled as UTF-8 and isn't, and maybe it's in a different encoding. You might need to use some lower-level tools, retrieving the page with `requests`, and then checking which encoding you can/should use to decode it.

@jessicabrock3220 2 жыл бұрын

Very helpful

@1994siddhu 2 жыл бұрын

Hello Mr. Lerner, This is Siddharth. I am using read_html for my python coding to read html files. It seems quite powerful in its working. Although, for html files of size less than 2 mb of data, it takes few seconds to run this command. But for large files of size such as 5 mb or more, it takes about half an hour for me for running the read_html command. Could you please suggest how to do about to read large html files in a quicker way?

@ReuvenLerner 2 жыл бұрын

You'll probably want to retrieve the HTML in the background using "requests", and then have pandas read the data without going through the network. My gut feeling is that this will cut down on the time -- but it could be that the HTML is complex, and that it'll take time and memory no matter what.

@1994siddhu 2 жыл бұрын

@@ReuvenLerner Okay. Thanks for the suggestion. I shall try that and let you know what i get.

@1994siddhu 2 жыл бұрын

@@ReuvenLerner Hello Mr.Lerner, I tied using requests but it didn't work. I think that's because I am not really "web" scrapping a html file. I am reading html file that is also stored in my system. In the read_html command, I tried the match command that will get me only the tables that I want filtering the rest, but that took similar time as well, may be because it is still reading through the whole html file to find tables that identify with my given match option. Is there a way to give multiple match options in one read_html command? Yeah, I agree. My html file does not have the same number of kind of columns throughout which is why it is taking so long. As a last resource, do you think running my code through multi processors where each processor takes care of one match input and then I join all the data at the end from all the procs? Could that work? Thank you, Siddharth

@ReuvenLerner 2 жыл бұрын

@@1994siddhu read_html isn't mean to do all of the complex things you're asking of it. If you have that much data, or that unknown of an HTML layout, then you might have to use something like "beautiful soup" to download and parse the HTML, and then hand Pandas a more traditional data structure. read_html is really meant for relatively simple pages with clear and obvious table layouts. The moment that you have different number of columns, you're kind of sunk. As for multiprocessing, I am not aware of a way to use it here, except if you parse multiple sites or files in parallel. Sorry I can't be of more help!

@kevindonovan3911 Жыл бұрын

Help...love some guidance. I'm using 3ish lines to scrape data (tables) from web pages. But stating this month I now get Nan Nan and not the numbers. from selenium import webdriver browser = start_firefox(URL, headless=True) html = browser.page_source arrays=pd.read_html(html) for i in arrays: print(i) I love the simplicity of this and use it on several web pages to get stock data. But now I'm only getting the column headings.. no data ? ? Any advice would be greatly appreciated. Kevin

@ReuvenLerner Жыл бұрын

Sorry, I don't really know much about Selenium.

@suwenhao9864 8 ай бұрын

cool video!

@ReuvenLerner 8 ай бұрын

Glad you enjoyed it!

@carlosfranchy878 Жыл бұрын

Hello! First at all nice video! Im working in a project where its very usefull to use pd.read_htlm. The problem i have is there are some data that are png, as the flag in your example. Is there anyways to convert this png into arrays? Ty!

@ReuvenLerner Жыл бұрын

Not in Pandas, so far as I know. Sounds like you would need some sort of OCR system to turn the graphic into text, but I don't know much about such things, I'm afraid.

@carlosfranchy878 Жыл бұрын

@@ReuvenLerner Okay, ty for the answer anyways!

@mandarraut9565 2 жыл бұрын

Hi Reuvin, This was helpful, Thankyou But i need some more help ,For eg.i have a set of links of same website amd i am trying to get html tables(Specification table ) but the issue here is that i am able to save html tables for each product that means if i have 20 links than i am saving 20 different excel files What i want is that if we can save all html tables into 1 excel and as we are saving specifications tables most of the time it may have same headers and different value. So whenever we scrape tables its values should append below each other as per specific header and if we find a new header it should append into headers and add its value under it. Please help me with this. I am unable to do so

@ReuvenLerner 2 жыл бұрын

According to the documentation for the to_excel method (pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_excel.html), you can save to multiple sheets in an Excel document. I've never done it myself, but there's an example toward the end of the documentation that shows you how to do that. I hope this helps!

@mandarraut9565 2 жыл бұрын

@@ReuvenLerner sure . I will try .thanks for the help

@pramishprakash Жыл бұрын

cool

@ReuvenLerner Жыл бұрын

Glad you enjoyed it!

@ye-ym5jo 2 жыл бұрын

Thanks a lot sir. I tried before with predetermined link on my online course, it always said key error, but when i tried with another url, it works. How could this happen?

@ReuvenLerner 2 жыл бұрын

Sorry, but I'm not sure.