Linux

Working with PDF files efficiently: WatchOCR

Optical Character Recognition: Why?

Graduate school is marked by a tremendous amount of reading. The vast majority of this reading seems to be in the form journal articles or book chapters which - thankfully - are often available electronically. (If they aren't, I often take the time to scan them myself.) I end up reading most of these on my tablet where I want to highlight text and otherwise annotate them. Sometimes, however, one comes across a PDF whose text cannot be selected - and therefore cannot have its text highlighted. The solution for this is to run optical character recognition (OCR) software on the file. While many modern scanners automatically perform OCR as part of the scanning process, I still come across enough scanned documents without select-able text to warrant this post (see Figure 1).

An example of selection in a document without OCR.
Figure 1. Come on, Adobe. You know that's not what I wanted.

There is considerable variety among the OCR solutions available. MakeUseOf gives its recommendations for three free OCR solutions, but all of them result in a the PDF's text being stored in a separate text document. This is useful if getting access to the raw text is the goal, but it is not sufficient for my purposes: I want the OCR'd text to be stored in the original PDF file in such a way as the text in the original file can be selected and highlighted. There are no doubt commercially available tools to accomplish this task, but I prefer free (and open source) tools whenever possible. Enter WatchOCR.


On a return to blogging after a hiatus

With the winter holiday I returned to my lazy, non-blogging habits. A New Year's resolution did little to change the situation. I suppose one just jumps in, though. I'll try to keep up with things more this semester. Really.

Plans for this semester

I'm currently taking a seminar on statistics education and an introductory course on qualitative methods. While the former is clearly my area of interest, the latter is proving to be more enjoyable than I had anticipated. One of the books for the course is Crotty's The Foundations of Social Research: Meaning and Perspective in the Research Process which is a bit more abstract than I was expecting, focusing on epistemologies and theoretical perspectives. It is a refreshing change, and I'm currently working my way through Feyerabend's Against Method after having my views on post-positivism challenged. (They seemed to be most aligned with Popper before this academic year.) Other plans include a trip to San Diego for LOCUS-related things and In-N-Out Burger, insha'Allah.

Dealing with Protected/Secured PDFs

Occasionally I'll come across a PDF that is Protected/Secured (it says 'SECURED' in the title bar of Adobe Reader) which are rather annoying to deal with. I've been using Mendeley to organize the articles/books I've read, and I copy the abstract into the software so that it can be searched. Alas, one journal whose articles I often read secure every single PDF so that copying cannot be done. Really frustrating.

Thankfully, this "secured" state is not encrypted or password protected. From what I gather, the state is determined by setting a bit in the file to disable certain features and Adobe, upon finding this information, respects the file's instructions. Not all software respects the file's instructions, and those that don't allow copying without issue. Two such readers are Evince (part of GNOME) and Okular (part of KDE). Both are open source, and both at least have options for disabling the DRM on the files. They are also both available on Windows (as well as many other platforms and are exceedingly common on Linux); if you're just looking for a quick download on Windows, Evince might be better. Either way, problem solved.

