<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>{5} Setfive - Talking to the World &#187; pdftotext</title>
	<atom:link href="http://shout.setfive.com/tag/pdftotext/feed/" rel="self" type="application/rss+xml" />
	<link>http://shout.setfive.com</link>
	<description></description>
	<lastBuildDate>Wed, 18 Jan 2012 21:09:19 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
		<item>
		<title>Extracting text from PDFs without pdftotext</title>
		<link>http://shout.setfive.com/2009/06/16/extracting-text-from-pdfs-without-pdftotext/</link>
		<comments>http://shout.setfive.com/2009/06/16/extracting-text-from-pdfs-without-pdftotext/#comments</comments>
		<pubDate>Tue, 16 Jun 2009 06:39:26 +0000</pubDate>
		<dc:creator>Ashish Datta</dc:creator>
				<category><![CDATA[General]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[open source]]></category>
		<category><![CDATA[pdftotext]]></category>

		<guid isPermaLink="false">http://shout.setfive.com/?p=179</guid>
		<description><![CDATA[For a recent project, I had to extract the text out of a PDF so that I could save it into a database table. Normally, I would of used the popular pdftotext program but it wasn&#8217;t available in the particular environment I was working in. I contact support and they advised that the XPDF package [...]]]></description>
			<content:encoded><![CDATA[<p>For a recent project, I had to extract the text out of a PDF so that I could save it into a database table.</p>
<p>Normally, I would of used the popular pdftotext program but it wasn&#8217;t available in the particular environment I was working in. I contact support and they advised that the <a href="http://www.foolabs.com/xpdf/">XPDF</a> package has several X windows dependencies and that&#8217;s why they had not installed it. Fair enough.</p>
<p>I poked around a bit and found Apache&#8217;s <a href="http://incubator.apache.org/pdfbox/">PDFBox</a> library. I downloaded the package and looked at the examples. Sure enough there was a program called &#8220;ExtractText&#8221; that did exactly what I wanted.</p>
<p>Using ExtractText is similar to pdftotext &#8211; just pass in the PDF file and the text comes back. Awesome.</p>
<p>Anyway, hats off to <a href="mailto:ben@benlitchfield.com">Ben Litchfield</a> who wrote the ExtractText example. I rebuilt the ExtactText.java file as a standalone project and packaged it as a JAR.</p>
<p>I&#8217;ve attached the JAR and Eclipse project if anyone wants a copy of either.</p>
<p><a href="http://shout.setfive.com/wp-content/uploads/2009/06/ExtractTextSF.jar">The JAR</a></p>
<p><a href="http://shout.setfive.com/wp-content/uploads/2009/06/ExtractText.rar">The Eclipse Project</a> </p>
]]></content:encoded>
			<wfw:commentRss>http://shout.setfive.com/2009/06/16/extracting-text-from-pdfs-without-pdftotext/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

