For a recent project, I had to extract the text out of a PDF so that I could save it into a database table.
Normally, I would of used the popular pdftotext program but it wasn’t available in the particular environment I was working in. I contact support and they advised that the XPDF package has several X windows dependencies and that’s why they had not installed it. Fair enough.
I poked around a bit and found Apache’s PDFBox library. I downloaded the package and looked at the examples. Sure enough there was a program called “ExtractText” that did exactly what I wanted.
Using ExtractText is similar to pdftotext – just pass in the PDF file and the text comes back. Awesome.
Anyway, hats off to Ben Litchfield who wrote the ExtractText example. I rebuilt the ExtactText.java file as a standalone project and packaged it as a JAR.
I’ve attached the JAR and Eclipse project if anyone wants a copy of either.