Tutorial: Create a HTML scraper with PhantomJS and PHP

This simple tutorial will show you how to create a PhantomJS script that will scrape the state/population html table data from http://www.ipl.org/div/stateknow/popchart.html and output it in a PHP application.  For those of you who don’t know about PhantomJS, it’s basically a headless WebKit scriptable with a JavaScript API.


1.  Create the PhantomJS Script

The first step is to create a script that will be executed by PhantomJS. This script will do the following:

  • Take in a JSON “configuration” object with the site URL and a CSS selector of the HTML element that contains the target data
  • Load up the page based on the Site URL from the JSON configuration object
  • Include jQuery on the page (so we can use it even if the target site doesn’t have it!)
  • Use jQuery and CSS selector from configuration object to find and alert the html of the target element. You’ll notice on line 37 that we wrap the target element in a paragraph tag then traverse to it in order to pull the entire table html.
  • We can save this file as ‘phantomJsBlogExample.js’
  • One thing to note is that on line 24 below we set a timeout inside the evaluate function to allow for the page to fully load before we call the pullHtmlString function. To learn more about the ins and outs of PhantomJS functions read here http://phantomjs.org/documentation/

2.  Create PHP function to run PhantomJS script and convert output into a SimpleXmlElement Object

Next, we want to create a PHP function that actually executes the above script and converts the html to a SimpleXmlElement object.

  • On line 3 below you’ll construct a “configuration” object that we’ll pass into the PhantomJS script above that will contain the site url and CSS selector
  • Next on line 10 we’ll actually read in the base PhantomJs Script we created in step 1. Notice that we actually make a copy of the script so that we leave the base script intact. This becomes important if you are executing this multiple times in production using different site urls each time.
  • On line 20 we prepend the configuration object onto the copied version of the phantomJS script, make sure you json_encode this so it’s inserted as a proper json object.
  • Next on line 29 we execute the phantomJs script using the PHP exec function and save the output into an $output array.  Each time the PhantomJS script alerts a string, it’s added as an element in this array. Alerted html strings will split out as one line per element in the array. After we get the output from the script we can go ahead and delete the copied version of the script.
  • Starting on line 38, we clean up the $output array a bit, for example when we initially inject jQuery in PhantomJS a line is alerted into the output array which we do not want as it doesn’t represent the actual html data we are scraping. Similarly, want to remove the last element of the $output array where we alert (‘EXIT’) to end the script.
  • Now that it’s cleaned up, we have an array of individual html strings representing our target data. We’ll want to remove the whitespace and also join all the elements into one big html string to use for constructing a SimpleXmlElement on line 49.

3.  Call the function and iterate through the SimpleXmlElement Object to get to the table data

  • Call the function from step 2 making sure to pass in the target site url and CSS selector
  • Now that we have the SimpleXmlObject on line 7 we’ll want to iterate through the rows of the table body and pull out the state name and population table cells. It may help to var_dump the entire SimpleXmlObject to get a sense for what the structure looks like.
  • For purposes of this example we’ll just echo out the state name and population but you could really do anything you wanted with the data at this point (i.e., persist to database etc.)

4.  Final Output

Finally, running the function from step 3 should result in something like this.

Javascript: Using Canvas to cut an area of an image

Over the few weeks I’ve been working on a Canvas based side project (more on that soon) that involved cutting a mask out of a source image and placing it on a Canvas. In Photoshop parlance, this would be similar to creating a clipping mask and then using it to extract a path from the image into a new layer. So visually, we’re looking to achieve something similar to:

At face value, it looks like doing this with Canvas is pretty straightforward using the getImageData function. Unfortunately, if you look at the parameters that function accepts it’ll only support slicing out rectangular areas which isn’t what we’re looking to do. Luckily, if you look a bit further in the docs it turns out Canvas supports setting globalCompositeOperation which allows you to control how image data is drawn onto the canvas. The idea is to draw the mask on a canvas, turn on the “source-in” setting, and then draw on the image that you want to generate the slice off. The big thing to note here is that putImageData isn’t effected by the globalCompositeOperation setting so you have to use drawImage to draw the mask and image data.

So concretely how do you do this? Well check it out:

The code is running over at http://symf.setfive.com/canvas_puzzle/grass.html if you want to see it in action.

Anyway, happy canvasing!

Symfony2: Outputting form checkboxes in a hierarchy

Recently when I was working on a client project we had a bunch of permissions which had a hierarchy (or tree structure). For example, you needed Permission 1 to have Permission 1a and Permission 1b. In the examples below lets assume `$choices` is equal to the following:

At first, I used the built in in optgroups of a the select box to output the form, so it was clear what permissions fell where. My form would look similar to:

Multiple select boxes aren’t the easiest to work with as we all know. Also, it isn’t as easy to visually see the difference as the height of the select box could not be long enough to show you what an optgroup’s title is. Instead, I decided to use the checkbox approach. Issue with this, the current Symfony2 form themes don’t output checkboxes in groups or with any visual indication of the hierarchy. I ended up creating my own custom field type so I could customize the way it renders globally via the form themeing. My custom type just always set the choice options to expanded and multiple as true. For the actual rendering, below is what I ended up with.

The above is assuming you are using bootstrap to render your forms as it has those classes. My listless class just sets the ul list style to none. The code should be fairly easy to follow, basically it goes through and any sub-array (an optgroup) it will nest in the list from the previous option. This method does assume that you have the ‘parent’ node before the nested array. I also in the bottom have some javascript that basically makes sure that you can’t check off a sub-group if the parent is not checked. When you first check the parent, it selects all the children. For the example I just put the javascript in there, it uses and id attribute, so you can only have one of these per page. If you were using this globally, I’d recommend tagging the UL with a data attribute and moving the javascript into a global JS file.

Since a picture is worth a thousand words, here is an example of what it looks like working:


Let me know if you have any questions! Happy Friday.

JavaScript: Using ChipmunkJS to build ping pong

A couple of weeks ago, we picked up a RaspberryPi for the office and started brainstorming ideas for cool hacks for it. One of the ideas that was floated was using the Pi as a server for some sort of multiplayer JavaScript game. I’ve actually never written a game before so I figured it would be an interesting project. We ended up trying to use node-qt for graphics along with ws to receive input via websockets but the refresh rate of Qt was just to low. A story for another time.

Anyway, so back to ping pong. One of the things we wanted to avoid was writing our own code to manage the game objects, we really wanted to use a game engine for this. The only caveat we had was that the engine had to be UI agnostic since we were planning to output graphics on Qt instead of a HTML5 Canvas. After looking around, most of the JavaScript game engines have strong dependencies on Canvas so we started looking at other options. The two strongest options were Box2JS and ChipmunkJS. We decided to go with ChipmunkJS in part because it was handwritten while Box2JS was automatically converted from ActionScript so the resulting code is harder to follow.

Getting ChipmunkJS setup is relatively painless, just include the file and you’re off to the races. Since we were looking to eventually use Qt, we tried our best to cleanly separate the Chipmunk code from what we’d be using to draw the objects on the screen. Because of that, it’s pretty easy to follow what’s going on with the Chipmunk code. If you take a look at cpPong.js it basically initializes the physics space and then starts the game.

With the physics space setup, the next step is setting up the drawing. Following the Chipmunk demo’s lead, I added specific “draw” methods to the Prototypes of each shape. Doing this, allows you to just iterate over each shape in the scene and call “draw” to have it appear on Canvas or Qt. Check out pong.html to see how we added the draw methods to the shapes. And that’s about it! The only other interesting part is using “requestAnimationFrame” to avoid using setTimeout to update the scene.

There’s a live demo running here and of course the code on GitHub.

Any questions or comments welcome!

Phonegap: Fixing black bars on iOS7/iPhone5

Last week, we were using Phonegap Build to build an iOS IPA for a project an ran into an odd issue. When we launched the app on an iPhone 5 running iOS7 black bars appeared at the top and bottom of the screen. On top of that, when launching the app we were observing the same issue with the splash screen.

In typical Phonegap fashion, Googling for people suffering from similar issues brought back dozens of results across several versions each with a different root cause and solution. One of the first promising leads we noticed was this comment in the top of the default Phonegap template:

Unfortunately, that comment seems either be invalid or the issue has since been resolved since removing those meta attributes had no effect.

As it turns out, the issue is that the “config.xml” file created by the default Phonegap “Hello world” project is missing an entry for the iPhone5’s screen size. Oddly enough, there’s actually a splash screen image of the correct height in the demo project but its not referenced in the config file. To resolve this issue, you just need to add this line to your config.xml:

Just make sure that you have a file named “res/screen/ios/screen-iphone-portrait-568h-2x.png” where the rest of your splash screens are and you should be good to go.