Extracting data from PDF files using Python

Рет қаралды 45,707

YUNIKARN

Күн бұрын

Пікірлер: 72

@ktmt100 2 жыл бұрын

Fantastic! My boss is a youtuber.

@YUNIKARN 2 жыл бұрын

KZbin is very hard work! I am trying to get better :-). We are always looking for presenters on the channel :-)

@seungholee8552 2 жыл бұрын

Very useful video, thank you!

@YUNIKARN 2 жыл бұрын

Thanks, Seungho. Python is the way!

@michaelobrist4716 2 жыл бұрын

Hi y'all! Thank you very much for this video. I've tried for hours to write a script that's doing exactly what you explain here. I've had almost given up but then my KZbin algorithm brought me here to the most comprehensive pypdf search string tutorial I've seen so far. However, I keep running into this freaking "TypeError: a bytes-like object is required, not 'dict'" which seems to be a thing with pypdf2 and python3. I've already researched for quite a while on this topic and just couldn't solve it. Since this video is relatively new, maybe there's hope that you or somebody else in here knows what to do? Thanks anyway, great tutorials!

@YUNIKARN 2 жыл бұрын

Hi Michael, thanks for your comment. This is much appreciated! The issue you encounter can occur due to many reasons. (1) I suggest to work with virtual environments to ensure version control. (2) Different characters (e.g., Chinese) need to be replaced with their HTML counterparts. I had such issues as I tend to work on China. (3) If nothing works, you might need to move to pdfminer. I hope that helps? I am working on another video focused on PDF files. Best wishes, Gerhard

@YUNIKARN Жыл бұрын

I did a new video on PyPDF2 changes and how to address them using virtual environments. Thanks for your comment! kzbin.info/www/bejne/aWbTY2BtaceLhLM

@michaelobrist4716 Жыл бұрын

@@YUNIKARN Thanks for the video! Highly appreciated!

@seanredmond9212 Жыл бұрын

this is a helpful video. thank you :)

@YUNIKARN Жыл бұрын

Glad it was helpful!

@kibtiachowdhury6011 2 жыл бұрын

Hi. I want to extract only paragraph and title without any table and figure from multiple pdf file. How can I solve this?

@YUNIKARN 2 жыл бұрын

If your PDFs refer to academic papers, the easiest approach is to use Google Scholar (API) and obtain titles and abstracts. Then you don't need to handle PDFs, which is faster. Otherwise you have to think about how to identify titles from PDF files, which is harder. You can get in touch (see channel pages) if you need help. We do projects and bespoke training

@agustincsn Жыл бұрын

Fantastic tutorial, thanks. I wonder how if we want to search multiple search terms and by the end make a table (csv) out of it? thanks

@YUNIKARN Жыл бұрын

I am glad that you enjoyed the tutorial. I have another video, which shows how to find the most common words in PDF files (see link). Yes, you can modify my code to look for several words and store the results in lists or other data structures. These lists can be exported into csv or Excel files (or many other formats). We can guide you if you need support. You find our email address on the Channel pages. May the Power be with you! kzbin.info/www/bejne/aaSTXod9gcd1aq8

@mirof1169 2 жыл бұрын

Hi there, thanks for the great video. Is there any way we can pick up the words/terms that occur the most? instead of searching for the word, ask python to show us like the top 10 or 20 words that repeat the most

@YUNIKARN 2 жыл бұрын

I am glad that you find this video useful. Finding the most common words is a nice problem. One has to remove stopwords (e.g., and, the, in) to get meaningful results. I am working on a video to address your question properly. I plan to upload this video on 4th July 2022 at 10am GMT. You can get updates on my channel and my Facebook page (link on the channel). Python is the way!

@mirof1169 2 жыл бұрын

@@YUNIKARN Thank you, sir. I appreciate your work.

@rivaltersilva9216 2 жыл бұрын

Excellent class. but how could I find words and select an entire sentence containing the same. Walter from Brazil

@YUNIKARN 2 жыл бұрын

Thanks for your comment! In principle, one could use the split method in Python and use list comprehension. For instance: page = "Hallo World yet again. I can see you. Find the word." then use: result = [sentence + '.' for sentence in page.split('.') if 'word' in sentence]. This might be a nice problem for another video. Best wishes, Gerhard

@juhaszat Жыл бұрын

Superb content Michael! Could you please remove the ")" from github-repo link?

@YUNIKARN Жыл бұрын

Updated - hope it works

@umamaheswararaom7909 2 жыл бұрын

How to convert different tables data in scanned image pdf into excel csv file

@YUNIKARN 2 жыл бұрын

Thanks your your question. Converting tables in PDF files into Excel files can be tricky. This requires another video. You can get updates on my channel and my Facebook page (link on the channel). Python is the way!

@walkwithus6536 Жыл бұрын

@@YUNIKARN yeah, please make vidio as soon as possible

@YUNIKARN Жыл бұрын

@@walkwithus6536 you can always drop us a line (email see Channel pages) if you need a tailor-made solution

@SuperPaulofeitosa Жыл бұрын

Excellent video, congratulation. Is possible make a search many words in same line? Example: From: Paulo Feitosa Sent: quinta-feira, 1 de dezembro de 2022 17:48 I have a PDF with may words From and Sent, i want search it and also a line PDF doc.

@YUNIKARN Жыл бұрын

That is possible. I have done another video on PDF files, which looks at related problems: kzbin.info/www/bejne/aaSTXod9gcd1aq8 - just get in touch if you need help (email on channel pages or www.yunikarn.com). May the Power be with you!

@YUNIKARN Жыл бұрын

My new Company Valuation course is out! Limited offer for USD 9.99 (expires in four days): www.udemy.com/course/company-valuation-a-guide-for-analysts-investors-and-ceos/?couponCode=FEA4E8F50C8E011B61F2

@catesconsultinggroupllc937 Жыл бұрын

Greetings, Great video tutorial. I have a question: I was able to search for a string of words using this code without any modifications. What I would like to do is return something based on the search words. For example: If I'm searching for the date something occurred, there is typically a preceding string. "Date of Service" should have a date following that string. How do I return the date just following that string? "Date of Service" 01/05/2019 for example. I want to return the date: 01/05/2019. There are 2 changes that would need to occur. How to return the date given it's not the search being made and since it is not a string. would we need to change the str anywhere in the code?

@YUNIKARN Жыл бұрын

I worked on a somewhat related problem. The task was to explore words in their context. The challenge is to ensure that all dates are captured even if the date formats change. A two step approach is usually best. 1. Get the whole sentence that contains your search term (careful with page breaks). 2. Use an algorithm to filter dates. Drop us a line (see channel pages) if you want a chat. May the Force be with you!

@gard8995 2 жыл бұрын

Hi. Thanks for a very helpful tutorial. Would it be possible to search for several strings at the same time and get an output something along these lines: Word A was found X times on pages x, y, z Word B was found X times on pages x, y, z And so on? Also, on top of that, could one run this script on several PDF files at the same time to get an output along these lines: Word A was found X times on pages x, y, z in document1 Word A was found X times on pages x, y,z in document2 Word B was found X times on pages x, y, z in document10 I'm a Python newbie so apologies in advance if my quesitons are stupid.

@YUNIKARN 2 жыл бұрын

Thanks! Yes, this can be achieved using loops (list of strings) and you can also loop through all pdf files in a folder. This would be a nice exercise

@tedmac8984 2 жыл бұрын

Sir, thanks for the great service, can you help me, if I want to extract data of each word into excel from pdf.

@YUNIKARN 2 жыл бұрын

I am glad you find it useful. I have done a related video, which identifies the most common words in a PDF document and returns a list. Lists can be exported as Excel files. (Link: kzbin.info/www/bejne/aaSTXod9gcd1aq8). In short, what you are looking for can be done. If you need further help, we do consulting projects and develop bespoke training. Our contact details are on the channel page

@michaelmraz2707 Жыл бұрын

Then how do you put that Director 31 times into an output table? I am trying to extract specific data from PDFs, for example, it would extract all rent expenses from a Financial Statement and tabulate the numbers into an output table. Any ideas?

@YUNIKARN Жыл бұрын

I have done another video kzbin.info/www/bejne/aaSTXod9gcd1aq8 on PDF files, where the search result is organised in a list. From Python lists (or other types), it is easy to construct tables (e.g., convert to Pandas DataFrame and export as csv or Excel file/table). However, if your PDF input refers to tables, you will need to modify your approach. The camelot library might be a useful starting point. Please get in touch if you want to discuss this problem in more detail. You can book consultations online www.yunikarn.com or drop us an email (see channel pages). May the Power be with you!

@hariprasad-ch2qc 10 ай бұрын

Can we identify a table in the PDF and represent the same in a tabular format?

@YUNIKARN 10 ай бұрын

If your PDF input refers to tables, you will need to modify your approach. The camelot library might be a useful starting point.

@academysolution8074 Жыл бұрын

Is it possible to extract only text that is in red color font from pdf by using font ???

@YUNIKARN Жыл бұрын

That is a problem that is not implemented in PyPDF2 as far as I know. PDFPLUMBER is able to extract font colour

@alvin3428 2 жыл бұрын

Hey! Thank you so much for such a wonderful video. I have a question, what if we have different purchase orders in different formats? How can we get the specific information out of them using python. I am doing a college year project and unable to proceed.

@YUNIKARN 2 жыл бұрын

Hi Alvin, thanks for your comment! I need more details to answer your question: (1) what do you mean by different formats (file type or text/tables)? (2) I need a minimum working example to understand the structure of the files. Email or DM on Twitter/Facebook might be easier. Python is the Way!

@alvin3428 2 жыл бұрын

@@YUNIKARN Hey! Thank you for responding. So, the purchase orders are of type : PDF. Different formats : the purchase orders incoming are of different templates which results in making it difficult to extract certain data each time and load it to excel. I am looking for something which could extract Po no, Quantity, Price etc from these pdf files (it could be located anywhere considering the fact that we have varying templates and not a standard one). Please help, I really want to pull off this project and make something useful.

@YUNIKARN 2 жыл бұрын

@@alvin3428 Hi Alvin, can you email a sample pdf file? Details are on the channel page. If the data is unstructured (e.g., not in a table), it might be hard to do. Best wishes, Gerhard

@feliciak3483 Жыл бұрын

Hi, this video is super helpful for understanding the process, thank you! However, when I run the code, I keep getting this exception: "PyPDF2.errors.DeprecationError: PdfFileReader is deprecated and was removed in PyPDF2 3.0.0. Use PdfReader instead." So I changed PdfFileReader to PdfReader in the code and then it said: "PyPDF2.errors.DeprecationError: reader.getNumPages is deprecated and was removed in PyPDF2 3.0.0. Use len(reader.pages) instead." I'm a little confused on how to change the code from here or what exactly to change to len(reader.pages) because substituting it into the existing code didn't work. Do you have any suggestions? Did PyPDF2 change?

@YUNIKARN Жыл бұрын

PyPDF2 and Python packages in general keep changing. In addition, their dependencies might change. This is the main reason why I use virtual environments (version control). There are two approaches: 1. Install an older version of PyPDF2 using pip (ideally use a virtual environment). 2. Read the documentation and update your code. Visit us on www.yunikarn.com or drop us an email if you need help. May the Power be with you!

@saeedewu129 Жыл бұрын

Hi. Thnx for your video. Is it possible to extract multiple search terms from multiple pdf files at a time?

@YUNIKARN Жыл бұрын

Multiple search terms could be arranged in a list and you can loop trough it. You might prefer your output arranged differently (e.g., dictionary, Excel file etc.). I have done another video that reads PDFs and outputs the most common words. You might find that helpful. Finally, Python can go through several PDF files. There are many ways to do it. An easy option is to store all files in the same folder and then go through the folder in a loop. May the Power be with you!

@saeedewu129 Жыл бұрын

@@YUNIKARN Many thnx for ur reply. Will work on that. Is there any way to communicate with you to get any tips or advices when I try to do it by myself and face any problem?

@YUNIKARN Жыл бұрын

@@saeedewu129 You find our email on the channel page or visit www.yunikarn.com

@saeedewu129 Жыл бұрын

@@YUNIKARN okay. many thnx

@walkwithus6536 Жыл бұрын

How to extract pdf tables files into excell?

@YUNIKARN Жыл бұрын

I have found a few videos by other creators that cover this topic (mostly for financial accounting). I might do a video on it in future - but my production pipeline is full for the next 4-5 weeks

@harishbollineni2588 2 жыл бұрын

how to install pip for virtual environment

@YUNIKARN 2 жыл бұрын

Hi Harish, If you use Anaconda, I need to work with the conda environment for updates. If you run Python directly, you can install the pip installer as follows: On Windows download get-pip.py (do a Google search). This needs to be on the same path as your Python installation. Then change the directory into the folder. Use cmd (command prompt) and type python get-pip.py. Finally check the installation using pip -V - Python is the Way!

@yck3810 2 жыл бұрын

Hi, may I know what python version you are currently using in this video? I am using 3.8 version, however I am not sure why, I think the extractText() functions seems to be obsolete.

@YUNIKARN 2 жыл бұрын

Thanks for your comment! Based on the documentation (pypi.org/project/PyPDF2/) the latest version of PyPDF2 should work fine with Python 3.8 and higher. For this video, I used Python 3.7.9 in my virtual environment and PyPDF2 version 1.27.1. One has to note that the extractText method has its limitations depending on the type of PDF file. I should do another video on it. Best wishes, Gerhard

@yck3810 2 жыл бұрын

@@YUNIKARN Hi Gerhard, first of all, thank you for your prompt response. Yes. I should have corrected my statement. The extractText() function is not obsolete. However, it doesn't work well with all types of pdf. Because apparently in my case, some of the pdf files work well, but some don't (I still have no idea how to differentiate what type of pdf is applicable and what is not). Anyway, thanks again for the documentation link provided. Keep up the good work. 👍

@YUNIKARN 2 жыл бұрын

@@yck3810 Yes, sadly the extractText() method has limitations. I will do a few more videos on fun with PDFs using Python. Best wishes, Gerhard

@walkwithus6536 Жыл бұрын

the git hub link is not working

@YUNIKARN Жыл бұрын

I tested the link github.com/GerhardKling/DataWrangling/tree/main/DataExtractionPDF in the description. It seems to work fine for me. Drop me a line (see Channel page for email) if you are having trouble, and I can send you the files by email. May the Power be with you!

@picklenickil Жыл бұрын

TLDR : langchain

@YUNIKARN Жыл бұрын

Langchain rules

@valmirrastelyjunior9400 9 ай бұрын

@YUNIKARN 9 ай бұрын

Thik hai - have a great 2024!

@Baka_Oppai Жыл бұрын

pypdf2 is just a mess of errors

@YUNIKARN Жыл бұрын

Yes, it is messy ... 🫠

@umamaheswararaom7909 2 жыл бұрын

How to convert tables in scanned image pdf into Excel csv file...

@YUNIKARN 2 жыл бұрын

@umamaheswararaom7909 2 жыл бұрын

@@YUNIKARN scanned image pdf needs OCR extraction which doesn't require for normal pdf .. Or is it the same way for both?

@YUNIKARN 2 жыл бұрын

@@umamaheswararaom7909 for scanned images OCR is the way to go. If the table is part of a pdf file, other methods might work as well. I will cover these aspects in future videos