Optical Character Recognition: Why?
Graduate school is marked by a tremendous amount of reading. The vast majority of this reading seems to be in the form journal articles or book chapters which - thankfully - are often available electronically. (If they aren't, I often take the time to scan them myself.) I end up reading most of these on my tablet where I want to highlight text and otherwise annotate them. Sometimes, however, one comes across a PDF whose text cannot be selected - and therefore cannot have its text highlighted. The solution for this is to run optical character recognition (OCR) software on the file. While many modern scanners automatically perform OCR as part of the scanning process, I still come across enough scanned documents without select-able text to warrant this post (see Figure 1).
There is considerable variety among the OCR solutions available. MakeUseOf gives its recommendations for three free OCR solutions, but all of them result in a the PDF's text being stored in a separate text document. This is useful if getting access to the raw text is the goal, but it is not sufficient for my purposes: I want the OCR'd text to be stored in the original PDF file in such a way as the text in the original file can be selected and highlighted. There are no doubt commercially available tools to accomplish this task, but I prefer free (and open source) tools whenever possible. Enter WatchOCR.
Many Linux programs follow the Unix philosophy: "Write programs that do one thing and do it well." For our OCR problem, this translates into different tools for each component of the OCR process (e.g. one tool for decomposing the PDFs into images and another tool for finding text in images). While perusing Slashdot, I found WatchOCR: a project that combines some of the free, OCR-related tools available for Linux and packages them together as a single, workable solution. Ostensibly, the program would be installed on a Linux computer and create two folders: you put PDFs to OCR in one and the program places the OCR'd documents in the other after a short time.
The WatchOCR system is available as a .deb package for Debian-based systems (e.g. Ubuntu), but I've never been able to get this installed because of dependency issues. Thankfully, the WatchOCR team provides Knoppix-based live CD with all of the dependency issues resolved. Just boot it up and it's ready to OCR documents with a neat-looking, browser-based interface to control the options, as in Figure 2 above. While one could ostensibly use the live CD on a physical computer attached to a network to OCR all of the documents for an office, for a single user it makes more sense to install WatchOCR to a virtual machine and use VirtualBox Shared Folders to manage PDFs.
Using WatchOCR in a virtual machine
Using WatchOCR in this way is relatively straightforward. Creating the virtual machine is pretty much the same as with every other Linux distribution. I gave WatchOCR 512 MiB of RAM, an 8 GiB dynamically-allocated disk, and 1 CPU core with no execution cap (100%), and this seems to be sufficient for converting the OCR process. Giving it less may result in it locking up because the process is rather resource-intensive, but I also don't know where the point of diminishing returns is. One can install WatchOCR to the disk from the Preferences -> KNOPPIX HD Install menu item.
Once WatchOCR is installed to disk, install the Guest Additions and set up a shared folder as usual. (For what it's worth, I use a Windows host.) Unfortunately, it doesn't seem to be particularly easy to get WatchOCR to auto-mount the shared folder, and, as the version of Knoppix used in the live CD is somewhat old, documentation is lacking (particularly with regard to the init.d and/or rc.local use). The way I finally got it to auto-mount the shared folder is by creating a file in
mountvboxfs.desktop (though the actual filename isn't tremendously important). The file contains the following text:
Encoding=UTF-8 Type=Application Name=sudo mount Comment=mounts Clearinghouse Exec=sudo mount -t vboxsf VMClearinghouse /home/knoppix/clearinghouse/ StartupNotify=false Terminal=false Hidden=false
The key line is the
Exec statement, as this is the command that is actually run. Now, I'm no security expert, but scripting a command to be run with
sudo doesn't seem like the smartest idea. However, considering the nature of the system and the fact that it is a virtual machine, this gets the job done and doesn't bother me too much. The
Name statement is what is added to the Preferences -> Desktop Session Settings list of Automatically Started Applications. (Putting the appropriate .desktop file in the autostart folder is how entries are added to this menu.)
Also of note is that VMClearinghouse is the Folder Name in VirtualBox, and clearinghouse is the name of the folder I created that the shared folder is mounted on. In the shared folder, I created scanin and scanout folders corresponding to the ones that WatchOCR monitors for files, and then just updated the interface to monitor the new files. Now, any PDF that needs to be OCR'd gets placed in the scanin folder (accessible to the host operating system and all other virtual machines I have through the Shared Folders feature) and the new file is available in scanout in a few minutes. Because the footprint of the WatchOCR VM is (relatively) small, I just leave it running all the time to make the process even more convenient.
A note on the output quality
The OCR performed by WatchOCR is... hit or miss. I just want to be able to select (and highlight) text in PDF files, and WatchOCR works well for this purpose. Actually searching or copying text from the OCR'd PDF files can be a nightmare, though. If the document is exceptionally clear with no angle to the text and Jupiter is in the Sixth House, you may just get lucky. Below are two typical examples of the quality.
Figure 3 shows that the document can be highlighted, but when the text is copied this is the result:
However, once viewed against the backdrop of an alternative-"constructivist"~
perspecti.ve on·how learning talces place, what "doing" mathem.
atics CaJ,1 mean, and what these imply for mathematics instruction, Hendry's
The text copied from Figure 4 is even worse:
2009w! ithp eoplwe hoh avenk'te ptu p;a re" paradigmbse hind"
Patto2n0, 08p,,2 69!a; ndh avfeo ro vehr alaf centuir ys,e ems,
notr eada ndloer ngagethde l inguistitcu rn,t hec ulturatlu rn,
The above examples are not to say that WatchOCR doesn't get it nearly perfect at times. It does, but its inconsistent output quality is a very real shortcoming of which users need to be aware. I think the reason for the poor performance is that WatchOCR is based on some outdated versions of the underlying PDF manipulation tools. Unfortunately, because of how precariously balanced the dependencies seem, simply updating its components seems difficult at best.
Of course, the best option for scanning documents with OCR is arguably to use scanning software that automatically does this. My cheap scanner came with software that does OCR well, and I imagine that as time goes on we'll run into fewer documents that need to be scanned and even fewer that have been scanned without OCR having been performed.
Update - December 8, 2014
I've been told that the WatchOCR website no longer works, and so the files from that project are no longer available. To ensure the files remain available, I've decided to make them available here. I don't have a lot of disk space or bandwidth (this website is a small hobby and I want to keep costs low), so I would appreciate it if you could pass these files on to colleagues and friends using USB drives and the like where possible. I've only ever had luck using the ISO file, but the Debian software packages the WatchOCR team released are also here.