Extracting text from PDFs without pdftotext

For a recent project, I had to extract the text out of a PDF so that I could save it into a database table.

Normally, I would of used the popular pdftotext program but it wasn’t available in the particular environment I was working in. I contact support and they advised that the XPDF package has several X windows dependencies and that’s why they had not installed it. Fair enough.

I poked around a bit and found Apache’s PDFBox library. I downloaded the package and looked at the examples. Sure enough there was a program called “ExtractText” that did exactly what I wanted.

Using ExtractText is similar to pdftotext – just pass in the PDF file and the text comes back. Awesome.

Anyway, hats off to Ben Litchfield who wrote the ExtractText example. I rebuilt the ExtactText.java file as a standalone project and packaged it as a JAR.

I’ve attached the JAR and Eclipse project if anyone wants a copy of either.

The JAR

The Eclipse Project