Extract text from Any PDF File (even scanned ones) using OCR pytesseract in 3 SIMPLE STEPS!

  Рет қаралды 23,102

Tech With Zoum

Tech With Zoum

Күн бұрын

Пікірлер: 53
@swetharamshetty109
@swetharamshetty109 Жыл бұрын
HI! Thanks a lot for the extraction , i want to convert a scanned pdf to editable word doc.In the above video the accuracy is 97% only
@techwithzoum
@techwithzoum Жыл бұрын
Hi Swetha, You're welcome! Can you please elaborate more your question?
@sarasa971
@sarasa971 Жыл бұрын
how to add other language in the code ? Thank you for the great explanation 👏🏼
@dyzy2203
@dyzy2203 2 жыл бұрын
Thanks a lot. The code works smoothly. Nice. Can you find, extract a table from a scanned PDF and save it into a dataframe ? Thx
@amanrohada9008
@amanrohada9008 2 жыл бұрын
Did you find something to extract table from scanned PDF?
@harshvardhanmishra1256
@harshvardhanmishra1256 3 ай бұрын
Did you both have found that? If yes then please help me out with this I am reaching the deadline and have to complete the task.
@kenvinmq
@kenvinmq Жыл бұрын
Thank you bro, I’ll try that out
@techwithzoum
@techwithzoum Жыл бұрын
You are welcome!
@zsuzsannakristof2117
@zsuzsannakristof2117 Жыл бұрын
Hi, can you modify the code that way, that the new file ext to the text contains the orginal page settings and structur of the orginal pdf. Like the text is in the same place where it was in the orginal pdf
@techwithzoum
@techwithzoum Жыл бұрын
Hi Zsuzsanna, I am not sure I understand your request. Can you please elaborate for better assistance?
@omuskaikar-gs1cs
@omuskaikar-gs1cs Жыл бұрын
there is a OCRmyPDF force -ocr library it retains the original format of pdf
@sjohn-777
@sjohn-777 6 ай бұрын
Thank you!
@techwithzoum
@techwithzoum 6 ай бұрын
You're welcome!
@RunRonaldRun
@RunRonaldRun Жыл бұрын
Works great, thank you so much.
@techwithzoum
@techwithzoum Жыл бұрын
You're very welcome, Charl!
@kibtiachowdhury6011
@kibtiachowdhury6011 2 жыл бұрын
Thanks a lot. The code works. I want to get paragraphs and titles without any tables or figures. How can I solve this?
@easylife891
@easylife891 Жыл бұрын
fantastic work
@techwithzoum
@techwithzoum Жыл бұрын
Thank you!
@davisengelis272
@davisengelis272 Жыл бұрын
thanks a lot!
@techwithzoum
@techwithzoum Жыл бұрын
You're very welcome, Davis!
@hrishishetty9322
@hrishishetty9322 3 жыл бұрын
Thank you so much for the help!
@techwithzoum
@techwithzoum 3 жыл бұрын
You're welcome! Do net hesitate to drop ideas of video!
@cherlynang2965
@cherlynang2965 Жыл бұрын
does this work on folder with multiple PDF files?
@techwithzoum
@techwithzoum Жыл бұрын
Yes, it does Cherlynang
@chepkoechfancy7553
@chepkoechfancy7553 3 жыл бұрын
Can this code work with pdf in url format? If so, kindly help lines of code to handle such
@ravimakwana5290
@ravimakwana5290 Жыл бұрын
Sir can you make a video on that like we have to extract the paragraph under the title from pdf.
@techwithzoum
@techwithzoum Жыл бұрын
Sure, Ravi! I will explore that!
@sivachaitanya6330
@sivachaitanya6330 9 ай бұрын
what version used in this, when i use it gives me poppler path error and tesseract install in pc and path settting error.....
@jeyapauldavid5596
@jeyapauldavid5596 5 ай бұрын
Unable to get page count. Is poppler installed and in PATH? the errror is comming
@techwithzoum
@techwithzoum 5 ай бұрын
This may be because your system can not access the 'poppler' module. Here is how to set up on a Windows machine: 1. Download the poppler package from this website: poppler.freedesktop.org/ 2. Unzip it in the C:\Program Files (x86) folder 3. Provide the bin folder into a variable you name as follows poppler_path= r"C:\Program Files (x86)\poppler-24.02.0\bin" I hope this helps.
@mohammednisar1458
@mohammednisar1458 3 жыл бұрын
PDFPageCountError: Unable to get page count.I/O Error: Couldn't open file 'C:\Users\Naseer\Desktop\OCR-main\data\First Cry Image.pdf': No error.
@avinashkrishna8695
@avinashkrishna8695 Жыл бұрын
i'm getting an error, Output exceeds the size limit. Open the full output data in a text editor
@techwithzoum
@techwithzoum Жыл бұрын
Hi Avinash, Can you tell more about which line the error occurs?
@jardanijonovich1951
@jardanijonovich1951 Жыл бұрын
Hi, came across ur video after multiple failed attempts of converting my file. Can I somehow ignore the Headers and footers. Also, I have bulletins in my documents and some of the bulletins are on the next page; how do I take care of that? Thanks in advance!!
@shainialakumbura5829
@shainialakumbura5829 3 жыл бұрын
PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH? why am I getting this error
@kiranvanukuri9382
@kiranvanukuri9382 3 жыл бұрын
U have load many PDFs at a time??
@mallikarjunyadav591
@mallikarjunyadav591 2 жыл бұрын
I am getting same error
@emanuelcalderon2912
@emanuelcalderon2912 6 ай бұрын
brew install poppler for macs, or install popler somehow for windows.
@QorQar
@QorQar Жыл бұрын
هل يمكن مثال على استعمال الكود واين يوضع وكيف اشغله
@avbendre
@avbendre Жыл бұрын
: Failed to activate VS environment: Could not find C:\Program Files (x86)\Microsoft Visual Studio\Installer\vswhere.exe any solution to the above error please telll
@techwithzoum
@techwithzoum Жыл бұрын
Can you please refer to this discussion on stackoverflow? It might be similar to what you are facing stackoverflow.com/questions/54305638/how-to-find-vswhere-exe-path
@avbendre
@avbendre Жыл бұрын
@@techwithzoum thank you the error resolved when added path in sys variables of poppler and pytesseract and installed pytesseract.exe
@techwithzoum
@techwithzoum Жыл бұрын
@@avbendre congratulations!
@kiranvanukuri9382
@kiranvanukuri9382 3 жыл бұрын
Sir super but one question.. Multiple PDFs how to extract text from group or many PDFs???
@KulranjanSingh
@KulranjanSingh 2 жыл бұрын
Use os.walk() or glob.glob
@vishalgarg8423
@vishalgarg8423 5 ай бұрын
PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?
@techwithzoum
@techwithzoum 5 ай бұрын
This may be because your system can not access the 'poppler' module. Here is how to set up on a Windows machine: 1. Download the poppler package from this website: poppler.freedesktop.org/ 2. Unzip it in the C:\Program Files (x86) folder 3. Provide the bin folder into a variable you name as follows poppler_path= r"C:\Program Files (x86)\poppler-24.02.0\bin" I hope this helps.
@TiriAlain
@TiriAlain 3 жыл бұрын
It's usefull, but my pc crash by out of memory or by cpu temperatur highter. ^^
@mohammednisar1458
@mohammednisar1458 3 жыл бұрын
I am getting this error
Lazy days…
00:24
Anwar Jibawi
Рет қаралды 8 МЛН
Accompanying my daughter to practice dance is so annoying #funny #cute#comedy
00:17
Funny daughter's daily life
Рет қаралды 26 МЛН
Why no RONALDO?! 🤔⚽️
00:28
Celine Dept
Рет қаралды 101 МЛН
Extract Text from PDFs & Images for LLMs Using Python
14:03
Tech With Zoum
Рет қаралды 24 М.
PyPDF4 : Read and Extract information from PDF's
11:22
Subham Sarkar
Рет қаралды 15 М.
Extract PDF Content with Python
13:15
NeuralNine
Рет қаралды 229 М.
Extract Text from PDF with Python
13:53
Chart Explorers
Рет қаралды 39 М.
[23] Use Python to OCR a scanned PDF for accounting
13:55
Pythonic Accountant
Рет қаралды 87 М.
How to Extract Text from Any Image with Python
12:21
ProgrammingKnowledge2
Рет қаралды 1,5 М.
Lazy days…
00:24
Anwar Jibawi
Рет қаралды 8 МЛН