Extracting text from PDFs without pdftotext

For a recent project, I had to extract the text out of a PDF so that I could save it into a database table.

Normally, I would of used the popular pdftotext program but it wasn’t available in the particular environment I was working in. I contact support and they advised that the XPDF package has several X windows dependencies and that’s why they had not installed it. Fair enough.

I poked around a bit and found Apache’s PDFBox library. I downloaded the package and looked at the examples. Sure enough there was a program called “ExtractText” that did exactly what I wanted.

Using ExtractText is similar to pdftotext – just pass in the PDF file and the text comes back. Awesome.

Anyway, hats off to Ben Litchfield who wrote the ExtractText example. I rebuilt the ExtactText.java file as a standalone project and packaged it as a JAR.

I’ve attached the JAR and Eclipse project if anyone wants a copy of either.

The JAR

The Eclipse Project

Monkeys and shakespeare: genetic algorithms with Jenes

The other night at a bar, we started talking about evolution which somehow sparked a discussion about the law of large numbers and the probability that humanity is just a cosmic fluke. Eventually, someone brought up the “monkeys on a typewriter” argument which caused uproar among the philosophers in the group.

This morning, I decided to see what Wikipedia had to say about monkeys and typewriters and eventually stumbled across an article about the “Weasel program” which Richard Dawkins wrote to demonstrate “random variation and non-random cumulative selection in natural and artificial evolutionary systems.” Basically, it simulates the monkeys on a typewriter to produce a line from Hamlet. At this point, I was hooked – I wanted to make one.

I’d experimented with genetic algorithms in a class I took at Tufts and I’ve been increasingly curious since the “evolving Mona Lisa” code got out on the web.

Anyway, I decided to use the Jenes library to whip up some code to “evolve” strings. The Jenes library is absolutely fantastic. It is easy to setup, easy to use, and the documentation is well written and easy to follow.

My implementation is online at: http://setfive.com/evolve.php

And it evolves Dawkin’s Hamlet line in about 3 seconds – link

The code to run the genetic algorithms is written in Java and uses a Jetty container to accept and processes HTTP requests. Using an embedded Jetty container proved to be seamless and the application server seems to running pretty smoothly.

A zip file containing an Eclipse project for the code is available here.

Additionally, a self contained JAR for the server is available here . Start it with java -jar wordga-jetty.jar

As always, questions and comments are welcome.

MS SQL starts on wrong port

Over the past few days we’ve been working with one of our clients to develop an enterprise search solution for one of their databases. Due to various historical reasons, the database is on MS SQL and has to stay that way. No worries right? Fail.

I decided to move forward with Solr because of its DataImportHandler feature and its ability to easily expose search results in various formats (JSON, XML, PHP serialized objects, etc).

For some reason I can’t seem to get MS SQL server to open on the port it is configured to. This proved particularly hairy to debug because I couldn’t tell if my JDBC DSN binding was failing, the JDBC driver was failing, or if the DB server was actually mucking it up.

The JDBC URL I’m using is:

url=”jdbc:sqlserver://localhost:1170;databaseName=dbName;User=test;Password=*****”

To try and narrow things down, I used the Eclipse Data Tools package to let me use a JDBC driver from within Eclipse to connect to the server. Using this, I could clearly see that the exception being thrown was that the MS SQL server was not accepting connections on port 1433.
As far as I can tell MS SQL is set up correctly.

sql-server-1

sql-server-2

I checked the MS SQL start up log and it was showing this:

Server is listening on [‘any’ 1170]

I tried using 1170 in Eclipse and Solr and bang everything fell into place.

What is weird is that the configuration is showing that the server should be running on 1433.
If anyone has any idea why this is happening I’d love to know for future reference.

I replicated this behavior on a production machine. In both instances I was using MS SQL Server 2008 Express. On the dev box I was running Windows XP SP2+ and the production box was running Windows Server 2003.

sfWidgetjQueryTimepickr – Symfony timepickr widget

A couple of months ago John Resig posted on his blog about a “new” way for users to pick time.

The component is a jQuery plugin called timepickr and I thought it was particularly neat. Anyway, I finally needed to write a form which had a time component so I figured I’d drop in the timepickr jQuery plugin.

It didn’t look like there was a Symfony Forms widget for it so I whipped one up. You can grab it here.

The only issue with timepickr is that it introduces a ton of dependencies. It requires jQuery, jQuery-ui, jQuery.utils, jQuery.string, and ui.dropslide. Additionally, it needs the dropslide css as well as the timepickr css.

The form widget assumes that you will include jquery-ui by yourself since everyone usually has different naming conventions. It will include the other JS files and CSS files for you.

You can download a package with all of the JS and CSS you need here. Again this DOES NOT include the jQuery-ui stuff. YOU have to include that yourself in your Symfony project as well as on the page that you deploy this widget.

To use it, just drop it in your Symfony project somewhere where classes get autoloaded (projectdir/lib works) and then instantiate a widget with new sfWidgetjQueryTimepickr() . It currently will support all of the timepickr options passed in as widget options (on the constructor). Full documentation for timepickr is here.

Have fun picking time.

Gmail fail

Looks like someone at gmail forgot to make sure the system won’t display negative time for future timestamped emails. I’m not sure where/how this email got a bad timestamp but gmail is displaying it as being sent in the future!
See it!