Tutorial: Create a HTML scraper with PhantomJS and PHP

This simple tutorial will show you how to create a PhantomJS script that will scrape the state/population html table data from http://www.ipl.org/div/stateknow/popchart.html and output it in a PHP application.  For those of you who don’t know about PhantomJS, it’s basically a headless WebKit scriptable with a JavaScript API.

Prerequisites:

1.  Create the PhantomJS Script

The first step is to create a script that will be executed by PhantomJS. This script will do the following:

  • Take in a JSON “configuration” object with the site URL and a CSS selector of the HTML element that contains the target data
  • Load up the page based on the Site URL from the JSON configuration object
  • Include jQuery on the page (so we can use it even if the target site doesn’t have it!)
  • Use jQuery and CSS selector from configuration object to find and alert the html of the target element. You’ll notice on line 37 that we wrap the target element in a paragraph tag then traverse to it in order to pull the entire table html.
  • We can save this file as ‘phantomJsBlogExample.js’
  • One thing to note is that on line 24 below we set a timeout inside the evaluate function to allow for the page to fully load before we call the pullHtmlString function. To learn more about the ins and outs of PhantomJS functions read here http://phantomjs.org/documentation/

2.  Create PHP function to run PhantomJS script and convert output into a SimpleXmlElement Object

Next, we want to create a PHP function that actually executes the above script and converts the html to a SimpleXmlElement object.

  • On line 3 below you’ll construct a “configuration” object that we’ll pass into the PhantomJS script above that will contain the site url and CSS selector
  • Next on line 10 we’ll actually read in the base PhantomJs Script we created in step 1. Notice that we actually make a copy of the script so that we leave the base script intact. This becomes important if you are executing this multiple times in production using different site urls each time.
  • On line 20 we prepend the configuration object onto the copied version of the phantomJS script, make sure you json_encode this so it’s inserted as a proper json object.
  • Next on line 29 we execute the phantomJs script using the PHP exec function and save the output into an $output array.  Each time the PhantomJS script alerts a string, it’s added as an element in this array. Alerted html strings will split out as one line per element in the array. After we get the output from the script we can go ahead and delete the copied version of the script.
  • Starting on line 38, we clean up the $output array a bit, for example when we initially inject jQuery in PhantomJS a line is alerted into the output array which we do not want as it doesn’t represent the actual html data we are scraping. Similarly, want to remove the last element of the $output array where we alert (‘EXIT’) to end the script.
  • Now that it’s cleaned up, we have an array of individual html strings representing our target data. We’ll want to remove the whitespace and also join all the elements into one big html string to use for constructing a SimpleXmlElement on line 49.

3.  Call the function and iterate through the SimpleXmlElement Object to get to the table data

  • Call the function from step 2 making sure to pass in the target site url and CSS selector
  • Now that we have the SimpleXmlObject on line 7 we’ll want to iterate through the rows of the table body and pull out the state name and population table cells. It may help to var_dump the entire SimpleXmlObject to get a sense for what the structure looks like.
  • For purposes of this example we’ll just echo out the state name and population but you could really do anything you wanted with the data at this point (i.e., persist to database etc.)

4.  Final Output

Finally, running the function from step 3 should result in something like this.

Symfony2: Outputting form checkboxes in a hierarchy

Recently when I was working on a client project we had a bunch of permissions which had a hierarchy (or tree structure). For example, you needed Permission 1 to have Permission 1a and Permission 1b. In the examples below lets assume `$choices` is equal to the following:

At first, I used the built in in optgroups of a the select box to output the form, so it was clear what permissions fell where. My form would look similar to:

Multiple select boxes aren’t the easiest to work with as we all know. Also, it isn’t as easy to visually see the difference as the height of the select box could not be long enough to show you what an optgroup’s title is. Instead, I decided to use the checkbox approach. Issue with this, the current Symfony2 form themes don’t output checkboxes in groups or with any visual indication of the hierarchy. I ended up creating my own custom field type so I could customize the way it renders globally via the form themeing. My custom type just always set the choice options to expanded and multiple as true. For the actual rendering, below is what I ended up with.

The above is assuming you are using bootstrap to render your forms as it has those classes. My listless class just sets the ul list style to none. The code should be fairly easy to follow, basically it goes through and any sub-array (an optgroup) it will nest in the list from the previous option. This method does assume that you have the ‘parent’ node before the nested array. I also in the bottom have some javascript that basically makes sure that you can’t check off a sub-group if the parent is not checked. When you first check the parent, it selects all the children. For the example I just put the javascript in there, it uses and id attribute, so you can only have one of these per page. If you were using this globally, I’d recommend tagging the UL with a data attribute and moving the javascript into a global JS file.

Since a picture is worth a thousand words, here is an example of what it looks like working:

permissions-example

Let me know if you have any questions! Happy Friday.

Javascript: Hijacking document.form.submit()

Earlier this week, I was helping a client of ours interface with a 3rd party widget on a site they work with. What the widget basically does is allow the user to input some information which is then POST’ed to another 3rd party site.

What our clients were looking to do was capture the information in the form before it was submitted, process it before the user left the page, and set any cookies on the user if necessary. Simple enough right? Use jQuery to trap the form’s submit event, do the processing dance, and then allow the form to submit normally.

So I implemented the code as described but for some reason the jQuery submit() handler was never being triggered. Perplexed, I looked through the actual widget code and it turns out that the widget was using a <a> tag with an onclick handler which eventually called document.someForm.submit(). Turns out, the jQuery submit() handler won’t trigger when a form is submitted in this fashion.

Thankfully, it’s relatively straightforward to get around this. You just need to override the form element’s submit() function with one of your own and then eventually call the original function once you’re done.

Well thats about it – As always, questions and comments welcome.

Javascript: Using Bootstrap tooltips with jQuery UI’s slider

Earlier today, I was adding a “slider” UI element to a project that was using Twitter Bootstrap as well as jQuery UI. Although they weren’t designed to work together, the two projects generally stay out of each other’s way since their CSS classes are namespaced pretty well.

Since jQuery UI was already loaded I naturally decided to just use the jQuery UI slider control to power the slider. One of the limitations with the jQuery UI slider is that it has no native way to show the current slider value over the slider handle, as a developer you have to display that number somewhere. Fortunately, the control has the event hooks neccesary to make this happen – specifically the slide event which is triggered everytime the slider is moved.

With Bootstrap also loaded, I decided to try and use the tooltip plugin to dynamically display the current value of the slider above the handle. Getting the initual tooltip setup was pretty straightforward. Check it out here.

But the issue is that even with the “slide” event, the Bootstrap tooltip plugin has no exposed method to force a reposition. The only way to get a tooltip to reposition itself is to hide it and then show it again. Obviously, that’s less than ideal since you get a noticible “jump” as the tooltip is hid and then shown again.

With this issue in mind, I decided to take a look at how the Tooltip plugin actually does the positioning. It turns out it’s really simple, the relevant code is on GitHub. Ok, so its easy to reposition them but how do you get the “right” tooltip div incase you have multiple sliders? Looking through the code, the Tooltip plugin actually uses the jQuery.data() function to store config options and additionally stores a reference to the correct div there. Getting a handle to the correct div is as easy as $(“#slider .ui-slider-handle:first”).data(“tooltip”).$tip

Looking at the actual plugin code, it’s simple enough to just copy that out and use it to reposition the tooltips. Check it out in action at http://jsfiddle.net/cqVPM/7/

Anyway, let me know if you run into any issues in the comments.

Custom Templates with jQuery File Upload

Recently I was working on a project which was using the jQuery File Upload Plugin to do multiple file uploads. I needed to show the progress of each upload (in this case just images). Looking through their documentation it shows how to a few ways to do custom templates. By default it uses the Javascript Templates Engine for all of its templating. I wanted to use just a div on the page for my template. Here’s how I ended up doing it.

First My basic markup for the html:

Basically my “#template” div was a hidden div on the page which I used as the photo being uploaded. Now for the javascript:

It ended up being pretty simple and self explanatory. The ‘progress’ function is called each time there is a progress update. You can do more advanced templates using their templating engine, however as I was adapting the code to an existing layout and was on a time constraint this was the route I took.

Hope this saves you sometime if you are looking to just quickly add a progress template for the uploaded images.