George from Tulsa here responding to Allison’s request for a show contribution to reduce her load this Thanksgiving week.
Years ago we paid a bank service company to microfilm file cabinets full of irreplaceable paper – some now 120 years old. The company then scanned the microfilm to image-only PDFs it delivered to us on optical disks.
The working set of PDFs currently resides on our Synology NAS where they’re in a folder structure organized by an indexed table of contents.
I’m now engaged in a project to run the gigabytes of image-only PDFs through Optical Character Recognition. This will enable searching for documents across the network by searching for text within documents, searching within open documents, and copying and pasting text and data tables to new documents and spreadsheets.
Since I’m mostly using Linux, specifically Linux Mint Cinnamon, I’m going to briefly describe here how that works in Mint and put the more difficult technical “stuff”, and all my links, at the bottom of today’s Shownotes. I’ll also talk about Mac options for the same process.
To begin, it’s necessary to download two applications from the Mint Software Center:
Tesseract is an OCR “engine” originally developed by Hewlett-Packard and maintained by Google since 2006. Tesseract is really fast, taking advantage of all 8 cores of my Ryzen 7 Processor. It’s also surprisingly accurate even on less-than-optimal scans of old paper. Many languages are available, but I’ve only installed English.
Tesseract is not user-facing. It must be invoked by another program. For us, that’s “OCRMYPDF” which is started by commands in the Linux terminal.
Running terminal commands can be scary. No worry here as all we’re doing is duplicating the original PDF to a new file with OCR without making any changes to the original. The command is one brief line you’ll be able to copy and paste from these shownotes where you’ll also find step-by-step instructions.
Processing is so fast I’m using the Linux Application PDF Arranger to merge related PDFs. Think monthly financial statements consolidated into searchable annual documents hundreds of pages long. That works great for what I’m doing. PDF Arranger will also split long documents into shorter chunks if that works better for you.
What if you want to OCR a file with a lot of text that’s saved as, for example, a JPG? Simply print the JPG to PDF and you’re good to go.
Okular is a Linux file viewer that has some editing and annotation capabilities. What I find invaluable is its Table Tool which extracts tabular data that can be pasted into spreadsheets for analysis.
One other application to mention. gImageReader, available on Windows and Linux, uses the Tesseract engine for granular OCR and editing of blocks of text. It does not embed the text within a PDF but saves it as a separate TXT file. Down in the Shownotes, there’s a neat video link demonstrating it being used to simultaneously OCR text in Korean and English while the user interactively corrects errors.
It’s of course possible to OCR digital documents on a Mac.
For a small number of documents, if you have a ScanSnap which comes with the limited version of ABBYY FineReader, the easiest solution is to print the PDF to paper then re-scan with OCR enabled. That won’t work for me because of the gigabytes I need to process and the forest all that printing would kill.
If you’re geeky and love playing with computers, you might be able to get Tesseract and OCRMYPDF to run on a Mac using MacPorts or HomeBrew.
The full Mac versions of ABBYY FineReader, a $69 annual subscription, and Adobe Acrobat PRO, $30 a month or $240 annually, do retroactive OCR. I had the, HA!, perpetual version of Adobe Acrobat PRO 8 and found its OCR results required significant manual correction. Perhaps Acrobat is much better now. Both offer free trials.
Amazon Software Downloads offers an apparently perpetual version of the ABBYY’s 2015 version. But from reviews, I suspect it isn’t compatible with current versions of macOS.
UPDF Googled up as another Mac and iOS option. Brief research revealed it is a product of the Chinese company Superace and its privacy statement makes clear that if you’re using its hallmark AI features your content will be uploaded to Superace’s servers.
Speaking of privacy policies, ABBYY’s, Adobe’s, and UPDF’s are all opaque and confusing, and I’m a lawyer. I’m pretty sure all are at the least monitoring when, where, how, and on what computer their software is used. Do read and understand their settings, privacy policies, and End User License Agreements, especially if you’re processing confidential documents.
Privacy is a reason you might want to try a Linux system of your own that can run open source applications which don’t phone home.
Cost is another reason. There’s a new generation of nano-sized Linux systems with useful specs that begin as low as $130. Compare that cost to Acrobat or ABBY. Or the $99 a year virtualization application Parallels that will run Windows and Linux on Macs and, boy does Parallels phone home.
I’m wrapping up my audio here, but if you’re interested in instructions and links, check out this Episode’s Shownotes at Podfeet.com
Steps to OCR using OCRMYPDF with Tesseract
The OCRMYPDF command in TEXT that can be copied and pasted into Terminal:
ocrmypdf --output-type pdf 1.pdf 2.pdf
Privacy Policies:
- ABBYY: pdf.abbyy.com/…
- Adobe: www.adobe.com/…
- UPDF: updf.com/…
- Parellels: www.alludo.com/…
ABBYY and Adobe Acrobat Pro Trials:
Linux OCR Software
- Tesseract: en.wikipedia.org/…
- ocrmypdf: github.com/…
PDFArranger
Okular – The Universal Document Viewer
gImageReader – Linux and Windows
- github.com/…
- YouTube video showing gImageReader in action: www.youtube.com/…
Run Linux software on a Mac?
“The Vault of Useless Backups,” where in Nosillacast #295 on Janauary 16, 2011 I first discussed the paper I’m now processing to OCR. “If there’s something you absolutely, positively have to keep, paper will outlive computers.” Subtext: proprietary computer gear and software will let you down when you need it most.
Maybe an Inexpensive NUC is All the Computer You Need
Overview of current Mini-PCs:
The Kamrui AK1 Plus is a dirt cheap mini PC with a 15-watt Intel Processor N95 quad-core chip and list prices starting as low as $180 (although the AK1 Plus is currently on sale for as little as $126).
Sounds like you need DEVONthink Pro that will create a copy of your PDFs with a text layer using ABBYY FileReader for Mac — not a cheap option, but it will also give you the tools to organise your PDFs and fantastic controls to search them. Well worth checking out — it has a very generous trial period and there’s even a free Take Control Book fo DEVONthink 3.
George mentions ABBYY and emphasizes that he was looking for a free and open source method to do this. For the occasional user, these solutions are far too expensive, I like George’s answer much better.
Yes, I use ocrmypdf to process PDF files on my Mac. Using a program is easier, but I have collected the needed terminal commands to process documents. I saved these commands in Notes.
You can use Tesseract directly. Eg. to get the text from a picture you can use the command:
tesseract ~/Desktop/Screenprint.png ~/Desktop/textfile -l nld
This takes the text from a picture and saves it as normal (ASCII) tekst. Nowadays this is build in macOS.
Cool, Frank. I tried that command and it didn’t work, and tesseract annoyingly has no man pages But I did find out through some searching that
-l nld
means to use the Dutch language! I don’t have any languages loaded. I stripped that off and got a nice text file in my native language. Fun stuff!Sorry, I should have left the language part off. You can find some Tesseract documentation here:
https://tesseract-ocr.github.io/tessdoc/Command-Line-Usage.html