Extracting text from PDFs without pdftotext

For a recent project, I had to extract the text out of a PDF so that I could save it into a database table.

Normally, I would of used the popular pdftotext program but it wasn’t available in the particular environment I was working in. I contact support and they advised that the XPDF package has several X windows dependencies and that’s why they had not installed it. Fair enough.

I poked around a bit and found Apache’s PDFBox library. I downloaded the package and looked at the examples. Sure enough there was a program called “ExtractText” that did exactly what I wanted.

Using ExtractText is similar to pdftotext – just pass in the PDF file and the text comes back. Awesome.

Anyway, hats off to Ben Litchfield who wrote the ExtractText example. I rebuilt the ExtactText.java file as a standalone project and packaged it as a JAR.

I’ve attached the JAR and Eclipse project if anyone wants a copy of either.

The JAR

The Eclipse Project

Loading Different Javascript/CSS Files in Different Environments

Today we were working on minifying our Javascript and CSS files for a site we are launching tomorrow.  We came across that we didn’t want to have to get the repository out of sync and update view.yml to only include our single CSS style sheet and one javascript file when working between developement environment and production.  What we wanted was the following in view.yml the javascripts: and stylesheets: parameters to change depending on what environment you are loading, however using prod: and dev: in this didn’t work like app.yml does.  An example of this would be:

all:
  title: My site
dev:
  stylesheets:[styleone.css,styletwo.css,stylethree.css]
  javascripts:[one.js,two.js,three.js]
prod:
  stylesheets:[style.min.css]
  javascripts:[main.min.js]

This way we could have them load in production much quicker and reduce load time, however still easily debug the files in development environments.  After looking around a bit we couldn’t find any standard current solution so we came up with the following.  In app.yml we defined two variables for the dev and production environments – javascript_files and css_files.  Here you put the list of the files you wanted to loaded in each environment.  Then in view.yml it now looks like:

all:
  title: My Title
  stylesheets: [<?php echo sfConfig::get('app_css_files');?>]
  javascripts: [<?php echo sfConfig::get('app_javascript_files');?>]

The configuration variables get their value depending on the environment you are running in so it loads the proper files.  With this configuration we can easily debug in development environment and still have minified versions of the CSS and Javascript in production environement.   Hopefully this will save some of you time and make your life easier.

Taking Your Application International

Many clients develop their applications with only one language in mind.  Recently one of our clients after we had developed two different applications for them decided that the applications needed to be translated into seven different languages.  At first the client said, “We’ll just make seven copies of it, and update each one separately.”  While this may at first seem the be a quick simple solution, think of the long term affects of this.  First, if the application is large, you are going to be wasting much space.  Second, and most important, using this approach you are going to be stuck trying to maintain seven different copies of the same application; every update for each each application will have to be made seven times.  Not only is this error prone, but it is inefficient.

Our solution?  Use a common practice called internationalization (i18n) and localization (l10n).  i18n and l10n is text translation (from page content to form labels to error message) and localizing of content ( displaying dates, currency, numbers, etc. in a specific format).  For many applications this is not an easy process, and often could require one to go back and rewrite much of the code.  However, we use Symfony which makes the task much easier.  Symfony allows you to use dictionary files and the database to handle this.  Symfony can scan your entire project looking for specific markup(__(“Text here”)) and pull the strings out into a simple XML file which you can enter the translations for.  The file looks like the following:

With file all you need to do is to modify the <target /> to <target>My Translated Text</target>  When the string “My String To Be Converted”  is output in the application it will be converted to “My Translated Text”.  You’d save this file for example as messages.es.xml if it was the Spanish translation file.

To read more on internationalization with Symfony visit http://www.symfony-project.org/book/1_2/13-I18n-and-L10n.  We use Symfony because of situations like this it allows us to quickly adapt our products to our customer needs.

Always think ahead when developing your applications.  Planning for the future work on an application can save you hours of rewriting.

Monkeys and shakespeare: genetic algorithms with Jenes

The other night at a bar, we started talking about evolution which somehow sparked a discussion about the law of large numbers and the probability that humanity is just a cosmic fluke. Eventually, someone brought up the “monkeys on a typewriter” argument which caused uproar among the philosophers in the group.

This morning, I decided to see what Wikipedia had to say about monkeys and typewriters and eventually stumbled across an article about the “Weasel program” which Richard Dawkins wrote to demonstrate “random variation and non-random cumulative selection in natural and artificial evolutionary systems.” Basically, it simulates the monkeys on a typewriter to produce a line from Hamlet. At this point, I was hooked – I wanted to make one.

I’d experimented with genetic algorithms in a class I took at Tufts and I’ve been increasingly curious since the “evolving Mona Lisa” code got out on the web.

Anyway, I decided to use the Jenes library to whip up some code to “evolve” strings. The Jenes library is absolutely fantastic. It is easy to setup, easy to use, and the documentation is well written and easy to follow.

My implementation is online at: http://setfive.com/evolve.php

And it evolves Dawkin’s Hamlet line in about 3 seconds – link

The code to run the genetic algorithms is written in Java and uses a Jetty container to accept and processes HTTP requests. Using an embedded Jetty container proved to be seamless and the application server seems to running pretty smoothly.

A zip file containing an Eclipse project for the code is available here.

Additionally, a self contained JAR for the server is available here . Start it with java -jar wordga-jetty.jar

As always, questions and comments are welcome.

MS SQL starts on wrong port

Over the past few days we’ve been working with one of our clients to develop an enterprise search solution for one of their databases. Due to various historical reasons, the database is on MS SQL and has to stay that way. No worries right? Fail.

I decided to move forward with Solr because of its DataImportHandler feature and its ability to easily expose search results in various formats (JSON, XML, PHP serialized objects, etc).

For some reason I can’t seem to get MS SQL server to open on the port it is configured to. This proved particularly hairy to debug because I couldn’t tell if my JDBC DSN binding was failing, the JDBC driver was failing, or if the DB server was actually mucking it up.

The JDBC URL I’m using is:

url=”jdbc:sqlserver://localhost:1170;databaseName=dbName;User=test;Password=*****”

To try and narrow things down, I used the Eclipse Data Tools package to let me use a JDBC driver from within Eclipse to connect to the server. Using this, I could clearly see that the exception being thrown was that the MS SQL server was not accepting connections on port 1433.
As far as I can tell MS SQL is set up correctly.

sql-server-1

sql-server-2

I checked the MS SQL start up log and it was showing this:

Server is listening on [‘any’ 1170]

I tried using 1170 in Eclipse and Solr and bang everything fell into place.

What is weird is that the configuration is showing that the server should be running on 1433.
If anyone has any idea why this is happening I’d love to know for future reference.

I replicated this behavior on a production machine. In both instances I was using MS SQL Server 2008 Express. On the dev box I was running Windows XP SP2+ and the production box was running Windows Server 2003.