Extract tabular data from PDF with Camelot Using Python

Рет қаралды 50,077

Күн бұрын

Пікірлер: 56

@frankdu7364 5 жыл бұрын

Hi Guys, Seems this video is gaining some traction and if you'd like to support this channel, please consider watching my other tutorials as well: frankdu.co/youtube. Thank you so much.

@artdoneus 4 жыл бұрын

By far the most useful and clear out video that i've seen on this topic thank you for your efforts!

@jonathanfriz4410 3 жыл бұрын

Hi, very good video. I don't remember if you mention this: Camelot won't work with image-based pdf, only with text-based pdf (so if you have pdf that comes from a scanner paper won't work). Only will take out the tables no the text. In OSX a text-based pdf is very likely you can use quick look and just copy and paste. It will work in a bunch of cases. For image base pdf I try with easyocr and pdf2image.

@AmitKumar-dt7sz 2 жыл бұрын

Extremely helpful video. Thanks for sharing

@asishraz6173 4 жыл бұрын

Very helpful video, I must say. Thank you for sharing with us. But I just wanted to ask, this 'Camelot' package is not workable when it comes to 'scanned images or scanned pdfs'? Please let me know if you know the solution for it. I have tried many approaches, but not able to extract the table data from the scanned image or pdf.

@Airsoftcan737 5 жыл бұрын

Would it be possible to extract only specific tables, for example you have several PDFs and you want to extract one table that has the information you want?, thanks

@madhurisree1687 4 жыл бұрын

Hi, want to extract invoice pdf file to csv or excel. How can I do that ply reply. Thank u

@Htyagi1998 2 жыл бұрын

You can use layout ml

@sathwikameenabad9789 4 жыл бұрын

read_pdf() is not working for me.Can you please help me with that? The error is:Please make sure that Ghostscript is installed I installed ghostscript and also added path. Help me with this,please

@jorgemayorga7600 3 жыл бұрын

I'm having the exact same issue. Did you find a solution?

@satyamgupta1105 5 жыл бұрын

it only parses the pdfs having a separtion line. Is there any other library which can parse the tables in pdfs having no separation lines?

@alaue 4 жыл бұрын

Thank you, this video helped me a lot.

@torrentinocom 4 жыл бұрын

Hi! how can i also get a titles of tables, which actually lie outside a table (on top-left side from table)??

@khanabbas4608 4 жыл бұрын

Sir, for ghostscript, do I need to download both GNU and Artifex, or just one? Many thanks!

@tlrlutz 4 жыл бұрын

I am following the instructions provided by Camelot and when I check the version of Ghostscript (gswin64c.exe -version) on my command line my PC says "this app can't run on your PC. To find a version for your PC, check with the software publisher" then the command prompt says "access is denied" any solutions?

@mmgwengi 4 жыл бұрын

Can you extract a specific table from a page that has multiple tables

@sadeksaci1247 Жыл бұрын

How to process a pdf file with multiple pages please

@hayathbasha4519 3 жыл бұрын

Hi, I am having large pdf where camelot takes lot of time to read Is it possible to read one page at a time

@DRocksRecords 4 жыл бұрын

Thank you very much

@akshayakmahanand3632 4 жыл бұрын

I have a PDF having multiple tables in it. I am using the for table in tables syntax but getting the IndexError: list index out of range erorr

@ayushi896 5 жыл бұрын

Hi, how can we read tables that has no borders or lines defined? Any idea????

@AltafKhan-pm3lk 4 жыл бұрын

did you get any answers/solutions for this?

@ananthsireesh 4 жыл бұрын

There are two flavours of the Camelot , it by default uses lattice which works for the tables seperated with lines, but you can also flavour of "stream" which has white spaces between cells, you can refer the documentation.

@ashu60071 3 жыл бұрын

i am trying to extract table from pdf as you shown but the contents are not coming. can't read the contents of the table only structure is coming.

@lidory98 3 жыл бұрын

how do I get rid of the first row of the indexes?

@MikeAkinyemi 5 жыл бұрын

Hi, when I run the program, I get RuntimeError('Please make sure that Ghostscript is installed') error. I am sure Ghostscript is installed. I use windows 10

@mikequest4620 5 жыл бұрын

Seth path of ghostscript

@mikequest4620 5 жыл бұрын

Seth path of ghostscript

@sreigurushyam 5 жыл бұрын

Hi, can i get the table title as well . If yes what should i do to get it

@frankdu7364 5 жыл бұрын

Hi, Thanks for your question! It seems Camelot won’t be very handy for such a job. Camelot is a master when extracting pure tabular data. It looked like you wanna extract text of the content. Maybe python module PyPDF2 is sth you’re looking for? Let me know. Thanks. Frank

@artoke84 4 жыл бұрын

hi, is it totally necessary to install Pandas library? or with Camelot is enough?

@frankdu7364 4 жыл бұрын

Hi David, Pandas shall be installed as a dependency when installing Camelot.

@hayathbasha4519 3 жыл бұрын

Hi, I am having table that starts in page 1 and ends at page 2 Page1 includes header and rows Page2 contains only rows In such case how to extract page2 data using Camelot

@luckysunda9623 4 жыл бұрын

Hi, Thanks for the video. I am getting no tables for the pdfs I want :(

@billbarron8666 4 жыл бұрын

Same here, have you been able to fix this?

@luckysunda9623 4 жыл бұрын

@@billbarron8666 No. The tables were really complicated in my case actually, even ABBY is not able to do a good job there.

@billbarron8666 4 жыл бұрын

@@luckysunda9623 you need camelotpro.

@ayush_shaz 5 жыл бұрын

Its only reading the first page of the pdf ....... what should i do ????

@saurabhrawat5999 5 жыл бұрын

yes, i am also facing the same problem. It's just reading the first page in the pdf. Any suggestion?

@saurabhrawat5999 5 жыл бұрын

Try this pages='1,2' or pages='all' worked for me

@HemantKumar-iy7dn 5 жыл бұрын

when we export all tables it makes multiple csv i want one file with merged indexes any suggestions

@jessicalee5175 4 жыл бұрын

Hi, Would you have a recommendation if I'm trying to extract a PDF file like a bank statement to CSV or Excel?

@frankdu7364 4 жыл бұрын

Hi Jessica, Thanks for your comment. So Camelot didn't work out for you? General approach could be: 1. Use other PDF files parsers like PyPDF2 to extract raw text info 2. If your text has certain pattern, you might be able to parse the raw text line by line(You can do some filtering as well of course). 2. Parsed text to excel or csv: there are plethora of tools you can use: Python module CSV, Pandas, Openpyxl etc. But the challenge here is the pdf file parsing part. If you don't mind sharing the file, I can have a look and try to release a new tutorial based on your case. Let me know. Frank

@jessicalee5175 4 жыл бұрын

@@frankdu7364 Hi Frank! Thanks so much for replying. The files are mostly clients files. I can try to create my own PDF that is similar. Would you have an email I can send it to?

@frankdu7364 4 жыл бұрын

@@jessicalee5175 Yes, Jessica. Just send to robot80053906@gmail.com. I will have a look and create a tutorial about it. Let me know here when you sent. Best

@berlusconitripurba2475 4 жыл бұрын

@@jessicalee5175 Halo Jes. Thank you for asking about this. I have similar case with you. Could you mind to branstorming about this case?. #BankStatement

@DRocksRecords 4 жыл бұрын

@@frankdu7364 this is a hilarious email adress I love it