Extract tabular data from PDF with Python

Extract tabular data from PDF with Python - Tabula, Camelot, PyPDF2

Рет қаралды 128,818

Күн бұрын

Пікірлер: 166

@softhints 5 жыл бұрын

The notebook link - github.com/softhints/python/blob/master/notebooks/Python%20Extract%20Table%20from%20PDF.ipynb Tabula - 1:50 Camelot - 7:48 PyPDF2 - 9:07

@matheusrodrigues-kf6pj 3 жыл бұрын

thank you for showing us tabula! really helpful!

@softhints 3 жыл бұрын

Glad it was helpful! Cheers!

@amiramorsli2265 Жыл бұрын

How I can delete the header and footer from PDF pages using the PyPDF2 library in Python. Thank you!

@softhints Жыл бұрын

It depends on the PDF file. But you can check this one: pypdf2.readthedocs.io/en/latest/user/extract-text.html def visitor_body(text, cm, tm, fontDict, fontSize): y = tm[5] if y > 50 and y < 720: parts.append(text) Cheers!

@amiramorsli2265 Жыл бұрын

@@softhints thanks:)

@umamaheswararaom7909 2 жыл бұрын

How to extract tables from scanned image pdf, what's the best library for OCR extraction, how to label the data in such documents

@softhints 2 жыл бұрын

It depends on the PDF files and data extracted. Is it financial data, commerce etc.

@paulmeloramos4858 8 ай бұрын

Buen video, les recomiendo para que no sufran con la instalación de librerias usar colab, se evitarán problemas si usan jupyter.

@softhints 7 ай бұрын

muchas gracias, amigo

@Ndofi 4 жыл бұрын

great one and thanks. I see tabula very pratical

@Ndofi 4 жыл бұрын

Why am i receiving the this error...No module named 'tabulate'..even after i have installed tabula-py ?

@softhints 4 жыл бұрын

@@Ndofi Are you running the same python version. Can you check the packages with pip freeze

@sourabhgadre9953 4 жыл бұрын

JavaNotFoundError: `java` command is not found from this Python process.Please ensure Java is installed and PATH is set for `java` . Please can someone help with error occurring while i try to import pdf

@softhints 4 жыл бұрын

Do you have Java on your machine? If not you can check: blog.softhints.com/ubuntu-18-check-install-java-jdk/ blog.softhints.com/how-to-install-oracle-jdk-ubuntu-18-in-2020/

@GururajSapkal Жыл бұрын

@@softhints above links are not valid anymore. cam you suggest alternative?

@softhints Жыл бұрын

@@GururajSapkal Hey, the links should be: softhints.com/how-to-install-oracle-jdk-ubuntu-18-in-2020/ softhints.com/ubuntu-18-check-install-java-jdk/

@GururajSapkal Жыл бұрын

@@softhints Thanks for prompt reply1

@WoW_Chillies 2 жыл бұрын

How to get the area parameters. Please guide.

@softhints 2 жыл бұрын

I didn't find good solution on the automating the area parameters. Cheers :)

@kalairajm8199 3 жыл бұрын

Bro read_pdf is Not define pls help

@Al-Ahdal 3 жыл бұрын

I used tabula and successfully read PDF, but the output is not coming in dataframe. Could you please help.

@softhints 3 жыл бұрын

Hi, WHat do you get as an output?

@Al-Ahdal 3 жыл бұрын

@@softhints something else, not dataframe. It may be an object because when I copied paste on Excel, it's coming in one column, data isn't parses and I guess regex requires.

@softhints 3 жыл бұрын

@@Al-Ahdal can you paste the result of type(df) the resulted variable.

@MuhammadUsman-ix6jo Жыл бұрын

How to extract table from unstructured PDF file?

@DeepChamuah 4 жыл бұрын

I have imported the 'food calories list' pdf, but unable to see it as a data frame. Type() method returns the output to be a list. Any idea?

@softhints 4 жыл бұрын

can you shod the list please and your code

@oOReflexiveOo 4 жыл бұрын

@@softhints mee to, When I use type the result is list whit 1 (len). Please your help

@softhints 4 жыл бұрын

@@oOReflexiveOo Can you share your code please? Can you check what is in your list by df[0] ? For me the code is working fine Greetings

@chibuzorahumaraeze418 Жыл бұрын

My df is not behaving like it should. Itis parced as a list instead

@softhints Жыл бұрын

Do you mean that your data is stored into a single column as list of list. If so - then you can check this link: datascientyst.com/normalize-json-dict-new-columns-pandas/ If you mean that data is extracted as a list of dataframes - then you can access them by index [0] etc. Cheers

@chibuzorahumaraeze418 Жыл бұрын

What I mean is when I extract it, it shows but doesn't seem to be in pandas dataframe. For example it is not recognising my column and how it displays the data is just wrong. Hence, it doesn't let me dropna() like you did in your video. It pops an attribute error:" 'list' object has no attribute 'dropna'"

@softhints Жыл бұрын

@@chibuzorahumaraeze418 Did you try to access the elements of this list by index. What is the result? result[0] result[0]

@SatvikSrivastava-js6gm 7 ай бұрын

Hi tabula is crashing again and again in my jupyter notebook , the kernel appears to have died it will restart automatically, anyone else faced this problem?

@priyankajain9859 3 жыл бұрын

Is there anyway to extract only the HEADET of a table?

@softhints 3 жыл бұрын

Do you mean the header of the table?

@priyankajain9859 3 жыл бұрын

@@softhints Yes. Sorry for typing mistake. Also apart from this topic. Do you know any algorithm for graph detection?

@softhints 3 жыл бұрын

@@priyankajain9859 If you extract the table as DataFrame then you can get only the header by: df.columns For graphs you can check: pypi.org/project/python-graph/ pypi.org/project/cydets/ pypi.org/project/graph-theory/2020.1.14.58965/

@priyankajain9859 3 жыл бұрын

@@softhints thank you.

@ashu60071 4 жыл бұрын

i tried extracting table from pdf all iam getting NaN values why??

@softhints 4 жыл бұрын

Hi, What is the table that you try to extract? Is there something extracted beyond the NaN values?

@ashu60071 4 жыл бұрын

@@softhints where shall I send you the pdf

@ashu60071 4 жыл бұрын

Can you help me automate character captcha please

@softhints 4 жыл бұрын

@@ashu60071 I don't have experience with captcha. You can check: pypi.org/project/captcha/ the email is in the about section

@JM-fr9bc 2 жыл бұрын

Hi, what do you do if your table spans multiple pages?

@softhints 2 жыл бұрын

In the comments below I added few tips. In general depends on the case.

@ukaszpawlak4854 5 жыл бұрын

Thank you for the tutorial.

@softhints 5 жыл бұрын

Glad to hear that. I'm planning several similar tutorials related to web data and API-s.

@softhints 2 жыл бұрын

*Update 2022* For complex tables with merged cells and bad formatting please try: datascientyst.com/extract-table-from-pdf-with-python-pandas/

@MrPalak01 5 жыл бұрын

fantastic Tutorial. How to extract Same table spans across multiple pages? How to differentiate that Table 1 is ended and Table 2 is started?

@softhints 5 жыл бұрын

Hi and thanks. I guess the answer will depend on the data and tables that you have. For example you can try to distinguish headers vs values by some property. In this example would be: energy content.

@CuriousMindCenter Жыл бұрын

Does tabula require that the PDF be tagged?

@marioustxexcel6375 2 жыл бұрын

thank you so much. did you compare with pdftools from R?. I normally use pypdf2 but sometimes the scripts are conversome to troubleshoot for complex tables in which the layout might change within the same document.

@softhints 2 жыл бұрын

No I didn't. Maybe in future I would do it. Thank you for the idea. Cheers :)

@rehanadgrt 7 ай бұрын

Facing same issue ,how handled?

@ScoutKnows 5 жыл бұрын

hi can you help with this one from tabula import wrapper from tabulate import import tabulate df = read_pdf("C:/Users/Othmane/Desktop/acs800.pdf") output : . . AttributeError: 'list' object has no attribute 'read'

@softhints 5 жыл бұрын

your import is wrong. It should be: from tabula import read_pdf from tabulate import tabulate

@manfyegoh 5 жыл бұрын

you import wrongly, should use from tabula import read_pdf

@ajithkumar-ho9xm 4 жыл бұрын

Is possible to change the particular image and content from the pdf?

@softhints 4 жыл бұрын

It depends on the PDF file and version. Is it stored as text or single image.

@Anonymouscrow-g9m 3 ай бұрын

I have multiple tables in single pdf page.

@goutamghosh1514 5 жыл бұрын

Thanks for this video. But Camelot is not working in aws lamda function. Can you help me out if you have any knowledge

@softhints 5 жыл бұрын

To be honest I don't have experience with Camelot and AWS lambda. Is there an error message or what is the happening? Is it possible to debug and check where is the problem or work with logs?

@goutamghosh1514 5 жыл бұрын

@@softhints Thanks for your update. It is showing "make sure Ghostscript to be installed" but this dependency is already with aws lambda layer.

@softhints 5 жыл бұрын

I was trying to find more information on the problem but I'm not able. Do you have progress on it?

@pixere1360 5 жыл бұрын

can we do same thing with python-OCR (pytessaract)? if possible can you handle both tabular data with text data like invoices and bills etc

@softhints 5 жыл бұрын

yes I think that it's possible to combine both. I can do a video about this in future.

@ankan8399 5 жыл бұрын

@@softhints Can you please upload this tutorial. As soon as possible

@vivekasthana12345 5 жыл бұрын

Thank you for such a good explanation. :) I am working on something similar but the tables in PDF are in image format (not in tabular), can you please suggest any blog or video from where I can get some help. Currently I am trying to work using pytesseract but it seems there are lot of dependencies I need to install and its not straight forward. Thanks

@softhints 5 жыл бұрын

Hi, I can try to solve your problem if you share more details with me. You can contact me by facebook for example: facebook.com/Softhints/. I have article about extracting text from images and how you can optimize the OCR: blog.softhints.com/python-extract-text-from-image-or-pdf/

@mathpix2143 4 жыл бұрын

You can use Mathpix Snip to digitize images of tables into TSV to paste into any spreadsheet! Here's a link: mathpix.com

@aiworksvelocityit4227 5 жыл бұрын

I have been able to output as json but how do you output as csv file?

@softhints 5 жыл бұрын

you can use this: df.to_csv()

@crazybauns 2 жыл бұрын

cant make tabula work it says the file path is incorrect and the file doesnt exist but the path is correct and the does exist any ideas?

@softhints 2 жыл бұрын

Can you try in a virtual environment. What does it say if you try: pip show tabula softhints.com/how-to-check-package-version-in-python/

@aiworksvelocityit4227 5 жыл бұрын

Hello, can someone please give me guidance on how to get the area? and can I provide more than one area? and what is 'guess' as shown in the tutorial? Thank you.

@softhints 5 жыл бұрын

You can have a look here: stackoverflow.com/questions/45457054/tabula-extract-tables-by-area-coordinates unfortunatelly I did some tests and it wasn't working as expected in the past or I did something wrong. Maybe you can share example (if possible and I can do some tests). Cheers

@aiworksvelocityit4227 5 жыл бұрын

Yes it does work. You have to use the measure tool in Adobe Acrobat DC and carry out the measurements of your object (e.g. table) and place it in the code by having the format y1, x1, y2, x2. Hope this makes sense and is helpful.

@aiworksvelocityit4227 5 жыл бұрын

Hello, I am using the tabula method shown in your video but how do I make it use the lattice method rather than stream. What is the code for it and where is it placed? Thank you.

@softhints 5 жыл бұрын

I think that you can do it in this way: df = read_pdf("./tmp/pdf/Food Calories List.pdf", encoding = 'ISO-8859-1', stream=True, area = [269.875, 12.75, 790.5, 961], pages = 4, guess = False, pandas_options={'header':None}) or df = read_pdf("./tmp/pdf/Food Calories List.pdf", encoding = 'ISO-8859-1', lattice=True, area = [269.875, 12.75, 790.5, 961], pages = 4, guess = False, pandas_options={'header':None})

@aiworksvelocityit4227 5 жыл бұрын

@@softhints Thank you and how do I get the area?

@aiworksvelocityit4227 5 жыл бұрын

Hi, my output from the localhost Tabula UI and the output from my Tabula (your tutorial) is different. When I put my PDF through the Tabula software UI, the output is perfect but when it goes through mine, the output is incorrect. Both have the same extraction (lattice) techniques. So, I am not sure where I am going wrong. I have pasted my code below and I am not sure how to get mine working. I would appreciate it, if you could guide me further. df = read_pdf('filename.pdf', pages="1", output_format="csv", encoding = 'ISO-8859-1', lattice=True, area = [280.022,35.328,467.447,564.878], guess = False, pandas_options={'header':None})

@aiworksvelocityit4227 5 жыл бұрын

I think I have worked out that I need multiple_tables in the code but then when I go to create a CSV or JSON file, it shows this error "AttributeError: 'list' object has no attribute 'to_csv'" and "TypeError: Object of type 'DataFrame' is not JSON serializable" respectively. Any ideas how to go about from here? I have looked on stack overflow but it is not providing solutions to fix my problem.

@softhints 5 жыл бұрын

@@aiworksvelocityit4227 can you provide example data from your data frame - for example your first 5 records with: df.head().to_json() or in case of error: df.head().values

@aiworksvelocityit4227 5 жыл бұрын

@@softhints I tried the code you gave me and it gave this error "'list' object has no attribute 'head'". I am not sure where to go from here. I have found this link www.pydoc.io/pypi/tabula-py-0.9.0/autoapi/wrapper/index.html but I am not sure how to use for the code as I am still a beginner. Can we be in touch via email? It would be easier to send the screenshots and share necessary files? Thank you.

@softhints 5 жыл бұрын

@@aiworksvelocityit4227 Can you print the df object and share it. It seems that you don't have a DataFrame but a list. You can create a dataframe by : pd.DataFrame(data=d) more here: pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html

@aiworksvelocityit4227 5 жыл бұрын

@@softhints Hi, how do I share it with you? KZbin comments do not allow to share images? I tried your code and it prints the dataframe but that's without multiple_tables=True in the code, and when i save it as a CSV file, the output formatting is incorrect. And when I put the multiple_tables=True in the code the dataframe prints but it is a very small table 1x1 and has no data and I cannot save it as CSV file either as it says issues with the list (same error as before). What do I do about this? Is there some way where I can share images and get help. I appreciate your time for helping me out. Thanks

@jihadbourassi8341 4 жыл бұрын

Thank you for the tutorial can it work on scanned pdf files?

@softhints 4 жыл бұрын

It should work. But depends on the case. I had some problems with scanned PDF-s for invoices.

@yasminekarray4530 3 жыл бұрын

@@softhints do you have an other solution for invoice image ?

@raghvendra87 5 жыл бұрын

Hi. Thanks for this. Really helpful. Does it work for all the languages like tables that have say Japanese text ?

@softhints 5 жыл бұрын

Yes, Normally it should work with different encodings. You can specify the one you need by: df = read_pdf("./tmp/pdf/Food Calories List.pdf", encoding = 'ISO-8859-1', you can also check this(there is an example for wiki and Chinese : kzbin.info/www/bejne/hYmkkI16ZsyFbKM github.com/softhints/python/blob/master/notebooks/Scrape%20wiki%20tables%20with%20pandas%20and%20python.ipynb

@txreal2 5 жыл бұрын

How can I specify page range using Tabula? Thanks for sharing.

@softhints 5 жыл бұрын

Nice question! So you can specify the page range by using string (page 1 to 3): df = read_pdf("./tmp/pdf/Food Calories List.pdf", pages='1-3') result in: 69, 5 You can use parameters with strings in this way: pages=(str(1)+'-'+str(3)) df = read_pdf("./tmp/pdf/Food Calories List.pdf", pages=pages) Another possible option is to pass list of pages like: df = read_pdf("./tmp/pdf/Food Calories List.pdf", pages=[1,2,3]) to list all possible pages as a list so you can do: pages = list(range(1, 4)) df = read_pdf("./tmp/pdf/Food Calories List.pdf", pages=pages) because the range is exclusive on the end.

@softhints 5 жыл бұрын

This is the source of the answer: pypi.org/project/tabula-py/

@txreal2 5 жыл бұрын

@@softhints Thanks! Appreciate your Github & other links. If you don't mind: To save data frame as csv in Jupyter Notebook, I would change to " output_format="csv"? What's your experience with tabula app Windows 10 vs tabula-py, which gives better table output for more complex pdf like the McKinsey above? The app gave me better-organized table than the py method, but I only tried one type of table. Found "A recent update of tabula-py" by Aki Ariga Feb 17, 2019. Would this help with your above formatting issues? blog.chezo.uno/a-recent-update-of-tabula-py-a923d2ab667b Keep up the good work.

@softhints 5 жыл бұрын

@@txreal2 Thanks a lot for the info - I'll check it. I don't have much experience with the tabula app but I can check it and test the McKinsey after this update. About the: If you don't mind: To save data frame as csv in Jupyter Notebook, I would change to " output_format="csv"? You can save the dataframe as CSV with: df.to_csv(file_name, sep='\t') and then download the file with: from IPython.display import FileLink, FileLinks FileLinks('.') #lists all downloadable files on server More info for downloading here: blog.softhints.com/python-jupyter-save-download-file/

@txreal2 5 жыл бұрын

@@softhints Hi Ivan, appreciate the quick reply and more info. This should helps me get an A for my Basic Python college class :) Hope you can use a small donation.

@pranjalgupta9427 3 жыл бұрын

Thanks ❤

@001Debjeet 4 жыл бұрын

i am getting HTTP Error 404: Not Found when I call pdf from direct from the given link i have already install all the packages others are working but this is not working throughing some error

@softhints 4 жыл бұрын

Maybe there is some redirection or antibot protection for the page. Can you check these options? another thing you can check the same script on different machine. Last resort will be to check this: github.com/tabulapdf/tabula/issues/521

@PallatiCharan 5 жыл бұрын

how to extract tabular data from scanned table images

@softhints 5 жыл бұрын

You can check this video for extraction of improved OCR: kzbin.info/www/bejne/pKOpkIWdnZ1rpNE

@taneryilmaz6171 4 жыл бұрын

Thank you for the this tutorial. i wonder can we extract mathematical graph from pdf to excel data automatical ? thank you in advance.

@softhints 4 жыл бұрын

I guess it depends on the pdf format and the graph itself. Do you have an example? Cheers

@spamtiu1292 5 жыл бұрын

can i use this in an android app?

@softhints 5 жыл бұрын

This is very interesting question. To be honest I'm not sure about this. From technical point of view you can write such application in java or python. Both can work with android - but I'll try to do a test in future and let you know. If you do the test before me - please share the results - or if you have any errors related to it. Thanks

@mathpix2143 4 жыл бұрын

Mathpix has an Android app that can do this, you can see for yourself here: play.google.com/store/apps/details?id=com.mathpix.snip

@Nimitz_oceo 4 жыл бұрын

Hi, first I want to thank you for the wonderful tutorial. I have a similar problem, except I’m dealing with financial statements. I will like to be able to extract the information in a form of dictionary and write to a file in a form of CSV file. Can you help on how to implement this particular solution? Thanks in advance.

@softhints 4 жыл бұрын

Once you have DataFrame - in this case df - you can save it as: * csv - by df.to_csv - pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html * json - df.to_json - pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_json.html

@aiworksvelocityit4227 5 жыл бұрын

Sir/Madam, you have been so helpful with this, I have got the data but do you know how to put the extracted data obtained by this model into a SQL database?

@softhints 5 жыл бұрын

You can check this video and the comments below: kzbin.info/www/bejne/jZO6YaV-eL1li7c kzbin.info/www/bejne/noa7eIStibiZg9U or this article blog.softhints.com/python-3-convert-dictionary-to-sql-insert/

@aiworksvelocityit4227 5 жыл бұрын

@@softhints Will the above tutorials work with sql server? Because I do not have mySQL, I need the tutorials to only work with sql server. Thank you for your time.

@softhints 5 жыл бұрын

@@aiworksvelocityit4227 yes the generated SQL code can be loaded in SQL server, Oracle or any other

@alejandrogg8633 5 жыл бұрын

@@softhints wow you´re a monster... what time do you sleep if you are replying to all your video comments! wow,.... anyway, thank you for the great video content, it was very nicely put and effectively explained

@softhints 5 жыл бұрын

@@alejandrogg8633 Thanks :) I'm trying to do my best. In general I try to sleep 8 hours when possible but this is not possible always :) Now I'm reading interesting book: Deep Work www.amazon.com/Deep-Work-Focused-Success-Distracted/dp/1455586692 Actually more like listening which helps me to change my habits in good I hope :) Cheers

@JM-fr9bc 3 жыл бұрын

Thank you for a great video. Is there a way to extract a specific table in a pdf that contains many?

@softhints 3 жыл бұрын

I think it depends on the PDF format, pages and the table. - is it on a specific page - is on specific area You can try by combination of both : stackoverflow.com/questions/45457054/tabula-extract-tables-by-area-coordinates or search for a given string. Cheers

@hayathbasha4519 3 жыл бұрын

Hi, I am having large pdf where camelot takes lot of time to read Is it possible to read one page at a time Thanks

@softhints 3 жыл бұрын

You can set pages by: camelot.read_pdf('your.pdf',pages=1,4-10,20-end )

@myanch200 5 жыл бұрын

Are you from Bulgaria?

@softhints 5 жыл бұрын

да :)

@AmitSharma-po1zb 5 жыл бұрын

Hi ..if we need to extract pdf table from a pdf document only when the page contains a keyword then how do we do it..

@softhints 5 жыл бұрын

You can use something like: lines = urllib.urlopen(link).readlines() for line in lines: if "keyword" in line: print line or : more advanced: tutorialedge.net/python/calculating-keyword-density-python/

@angeloj.willems4362 5 жыл бұрын

CalledProcessError: Command '['java', '-Djava.awt.headless=true', '-Dfile.encoding=UTF8', '-jar', '/anaconda3/lib/python3.7/site-packages/tabula/tabula-1.0.3-jar-with-dependencies.jar', '--pages', '1', '--guess', 'Annual report 2014.pdf']' returned non-zero exit status 1.

@softhints 5 жыл бұрын

This seems like a problem related to the file or the file path. Are you using windows? Is the file in the same folder as the notebook?

@zeeshanhabib3181 5 жыл бұрын

Hey, it's useful. but in source code i cannot get to start the code, can u help me how i start it.

@softhints 5 жыл бұрын

What is the problem that you have? You have all the steps described in the Notebook: github.com/softhints/python/blob/master/notebooks/Python%20Extract%20Table%20from%20PDF.ipynb in order to run the notebook you need to run jupyter server by: jupoyter-notebook then upload the notebook and run the cells.

@hematogen50g 3 жыл бұрын

I can read docs myself.

@zeeshanhabib3181 5 жыл бұрын

Can you tell me the steps pleas.

@softhints 5 жыл бұрын

@beibarsiran9318 4 жыл бұрын

почему на русском тоже самое не выложешь? материал топ

@softhints 4 жыл бұрын

Спасибо! I'm not speaking Russian very well. I can understand a bit but it's difficult to speak or write.

@nabilahhannani2326 5 жыл бұрын

Hello, sir, thanks for sharing, can I send you some question about this via email? thank you

@softhints 5 жыл бұрын

sure buddy

@nabilahhannani2326 5 жыл бұрын

@@softhints Thank you, what is ur email? :)

@softhints 5 жыл бұрын

@@nabilahhannani2326 You can find it on this page: kzbin.info/door/g5rvP_D735oSBatdcH5ZFAabout?view_as=subscriber Details For business inquiries: View email address

@nabilahhannani2326 5 жыл бұрын

@@softhints thank you :), i already sent my email

@softhints 5 жыл бұрын

@@nabilahhannani2326 I don't have mail from you. Anyway you can ask also here: facebook.com/Softhints/

@devpriyashivani7400 5 жыл бұрын

Very blur video.

@softhints 5 жыл бұрын

what is the resolution at what you watch it?

@devpriyashivani1855 5 жыл бұрын

@@softhints standard laptop screen

@devpriyashivani1855 5 жыл бұрын

@@softhints however I got the solution from the github link provided

@devpriyashivani1855 5 жыл бұрын

@@softhints I need some more help, the two columns are getting merged while reading the file, is there any solution for it?

@softhints 5 жыл бұрын

@@devpriyashivani1855 which two columns are merged can you give the line of the code and the result (at least the dataframe columns and one row). Thanks