Category: Tips n’ Tricks

Recently we’ve been working with one of our clients to build application for use with AppNexus.  We were faced with a challenge which required a bunch of different technologies to all come together and work together.  Below I’ll try to list out how we approached it and what additional challenges we faced.

First came the obvious challenge:  How to handle at least 25,000 requests per second.  Our usual language of choice is PHP and knew it was not a good candidate for the project.  Instead we wanted to do some benchmarks on a number of other other languages and frameworks.  We looked at Rusty/Nginx/Lua, Go, Scala, and Java.  After some testing it appeared that Java was the best bet for us.  We initially loaded up Jetty.  We knew that this had a bit more baked in than we needed, but it was also the quickest way to get up and running and could be migrated away from fairly easily.    The idea overall was to keep the parsing of the request logic separate from the business logic.  In our initial tests we were able to get around 20,000 requests a second using Jetty, which was good, but we wanted better.

Jetty was great at breaking down the incoming HTTP requests to easily work with, it even provided an out of the box general statistics package.  However, we didn’t need much heavy lifting on the HTTP side, what we were building required very little complexity on with regards to HTTP protocol.   Jetty in the end was spending too many CPU cycles for what we needed.  We looked to Netty next.

Netty out of the box is not as friendly as Jetty as it is much lower level.   That said, it wasn’t too much work to get Netty up and running responding to HTTP request.  We ported over most of the business logic from our Jetty code and were off to the races.  We did have to add our own statistics layer as Netty didn’t have an embedded one for what we were looking for.  After some fine tuning with Netty we were able to start to handle over 40,000 requests per second.  This part of the puzzle was solved.

On our DB side we had heard great things about Aerospike in terms of performance and some of its features.  We ended up using this on the backend.  When we query Aerospike we have the timeout set at 3ms.  We’ll get around one or two request timeouts per second, or about 0.0025% of the time we’ll timeout, not too shabby. One of the nice features of Aerospike is the XDR function of the enterprise version.  With this we can have multiple Aerospike clusters which all stay in sync from a master cluster.  This lets us load our data onto one machine, which isn’t handling all the requests, and then it is replicated to the machines which are handling all the requests.

All in all we’ve had a great experience with the Netty and Aerospike integration.  We’re able to consistently handle around 40,000 requests a second with the average response time (including network time) of 4ms.

Posted In: General, Tips n' Tricks

Tags: , , , ,

This simple tutorial will show you how to create a PhantomJS script that will scrape the state/population html table data from http://www.ipl.org/div/stateknow/popchart.html and output it in a PHP application.  For those of you who don’t know about PhantomJS, it’s basically a headless WebKit scriptable with a JavaScript API.

Prerequisites:

1.  Create the PhantomJS Script

The first step is to create a script that will be executed by PhantomJS. This script will do the following:

  • Take in a JSON “configuration” object with the site URL and a CSS selector of the HTML element that contains the target data
  • Load up the page based on the Site URL from the JSON configuration object
  • Include jQuery on the page (so we can use it even if the target site doesn’t have it!)
  • Use jQuery and CSS selector from configuration object to find and alert the html of the target element. You’ll notice on line 37 that we wrap the target element in a paragraph tag then traverse to it in order to pull the entire table html.
  • We can save this file as ‘phantomJsBlogExample.js’
  • One thing to note is that on line 24 below we set a timeout inside the evaluate function to allow for the page to fully load before we call the pullHtmlString function. To learn more about the ins and outs of PhantomJS functions read here http://phantomjs.org/documentation/

2.  Create PHP function to run PhantomJS script and convert output into a SimpleXmlElement Object

Next, we want to create a PHP function that actually executes the above script and converts the html to a SimpleXmlElement object.

  • On line 3 below you’ll construct a “configuration” object that we’ll pass into the PhantomJS script above that will contain the site url and CSS selector
  • Next on line 10 we’ll actually read in the base PhantomJs Script we created in step 1. Notice that we actually make a copy of the script so that we leave the base script intact. This becomes important if you are executing this multiple times in production using different site urls each time.
  • On line 20 we prepend the configuration object onto the copied version of the phantomJS script, make sure you json_encode this so it’s inserted as a proper json object.
  • Next on line 29 we execute the phantomJs script using the PHP exec function and save the output into an $output array.  Each time the PhantomJS script alerts a string, it’s added as an element in this array. Alerted html strings will split out as one line per element in the array. After we get the output from the script we can go ahead and delete the copied version of the script.
  • Starting on line 38, we clean up the $output array a bit, for example when we initially inject jQuery in PhantomJS a line is alerted into the output array which we do not want as it doesn’t represent the actual html data we are scraping. Similarly, want to remove the last element of the $output array where we alert (‘EXIT’) to end the script.
  • Now that it’s cleaned up, we have an array of individual html strings representing our target data. We’ll want to remove the whitespace and also join all the elements into one big html string to use for constructing a SimpleXmlElement on line 49.

3.  Call the function and iterate through the SimpleXmlElement Object to get to the table data

  • Call the function from step 2 making sure to pass in the target site url and CSS selector
  • Now that we have the SimpleXmlObject on line 7 we’ll want to iterate through the rows of the table body and pull out the state name and population table cells. It may help to var_dump the entire SimpleXmlObject to get a sense for what the structure looks like.
  • For purposes of this example we’ll just echo out the state name and population but you could really do anything you wanted with the data at this point (i.e., persist to database etc.)

4.  Final Output

Finally, running the function from step 3 should result in something like this.

Posted In: Javascript, jQuery, PhantomJS, PHP, Tips n' Tricks

Over the few weeks I’ve been working on a Canvas based side project (more on that soon) that involved cutting a mask out of a source image and placing it on a Canvas. In Photoshop parlance, this would be similar to creating a clipping mask and then using it to extract a path from the image into a new layer. So visually, we’re looking to achieve something similar to:

At face value, it looks like doing this with Canvas is pretty straightforward using the getImageData function. Unfortunately, if you look at the parameters that function accepts it’ll only support slicing out rectangular areas which isn’t what we’re looking to do. Luckily, if you look a bit further in the docs it turns out Canvas supports setting globalCompositeOperation which allows you to control how image data is drawn onto the canvas. The idea is to draw the mask on a canvas, turn on the “source-in” setting, and then draw on the image that you want to generate the slice off. The big thing to note here is that putImageData isn’t effected by the globalCompositeOperation setting so you have to use drawImage to draw the mask and image data.

So concretely how do you do this? Well check it out:

The code is running over at http://symf.setfive.com/canvas_puzzle/grass.html if you want to see it in action.

Anyway, happy canvasing!

Posted In: Javascript, Tips n' Tricks

Tags: , ,

Recently I was working on a project where part of it was doing data exports. Exports on the surface are quick and easy – query the database, put it into the export format, send it over to the user. However, as a data set grows, exports become more complicated. Now processing it in real time no longer works as it takes too long or too much memory to export. This is why I’ll almost always use a background process (notified via Gearman) to process the data and notify the user when the export is ready for download. On separate background threads you can have different memory limits and not worry about a request timeout. I suggest trying to not use Doctrine’s objects for the export, but get the query back in array format (via getArrayResult). Doctrine objects are great to work with, but expensive in terms of time to populate and memory usage; if you don’t need the object graph results in array format are much quicker and smaller memory wise.

On this specific export I was exporting an entity which had a foreign key to another table that needed to be in the export. I didn’t want to create a join over the entire data set as it was unnecessary. For example, a project which has a created by user as a relation. If I simply did the following:

I’d end up with an array which had all the project columns except any that are defined as a foreign key. This means in my export I couldn’t output the “Created by user id” as it wasn’t included in the array. It turns out that Doctrine already has this exact situation accounted for. To include the FK columns you need to set a hint on the query to include meta columns to true. The updated query code would look similar to:

Now you can include the foreign key columns without doing an joins on a query that returns an array result set.

Posted In: Doctrine, General, PHP, Symfony, Tips n' Tricks

Tags: , , , ,

Because sometimes it’s just fun to make something absurd: http://taken.setfive.com/.

taken

If you aren’t familiar, the Taken movie series (http://www.imdb.com/title/tt0936501/) is a set of ridiculous action dramas that have acquired a strong cult following over the years (deservedly or not). All of the movies basically involve someone close to a retired CIA agent, played by Liam Neeson, being “Taken” and Neeson subsequently raining hell over anyone involved in wronging him. As a result of the movie series, Liam Neeson has made a formidable run at Kiefer Sutherland’s title of today’s Chuck Norris. Neeson, who plays the main character in the movie employs his signature throat punch in place of Norris’ roundhouse kick.

Our team seems to share a fondness for laughable action movies and actually took the afternoon off to watch the latest Taken 3 film when it debuted a couple weeks ago. With a little bit of vacation downtime prior to the debut and the urge to develop a web application as preposterous as the film series, we came up with the idea of creating Taken Audio MadLibs (http://taken.setfive.com/). For those of you not familiar MadLibs (http://en.wikipedia.org/wiki/Mad_Libs) is a phrasal template word game where one player prompts others for a list of words to substitute for blanks in a story, before reading the – often comical or nonsensical – story aloud.

So our take on the MadLibs program works something like this:

-There are four different “story lines” which all involve dialog between Liam Neesons character and a second character
-Neesons lines are actual lines from the Taken movies
-The other character’s lines are transformed into audio clips by Google’s unofficial Text To Speech (TTS) API and are based on the words you enter into a simple webform
-The lines are combined together into a hilarious back and forth dialog between Neeson and your character

Technically, there really isn’t too much “magic” going on with the program. It’s built on the Symfony2 framework and employs a simple one page parallax scrolling design for the web forms. Once the user submits the web form for their story we send it off to the controller where the user entered words are inserted into a set of templated lines. These text lines are then sent off to the Google TTS API which returns an .mp3 audio file with the audio representation of the text. We then splice together the Google TTS mp3 lines with the Taken audio files that we have stored on the server and combine into one audio file. The audio file is returned to the UI in the form of a HTML5 audio tag where the user can play or download the file. We also provide the user with the option of emailing the audio file to a friend if they would like to.

There were two problems that we ran into worth mentioning for those of you playing around with Google’s TTS api or combining multiple audio files of different formats.

1. Google’s TTS API only accepts 100 characters per call so you’ll have to split a given line or sentence into 100 character chunks and then combine the multiple mp3s back into one. This isn’t too difficult to do but worth mentioning if you ever plan to play around with this API.

2. We did run into a bit of trouble trying to combine the .mp3 files that Google returned with the Taken audio .mp3 files we got from (http://www.soundboard.com/sb/Taken_sound_clips). The problem is that the frame rate of the Google .mp3 files is different than the Taken files so when we tried to combine them into one some audio players would not render the resulting file. To get around these issues we took the following steps the combine and massage the audio files via a couple different server-side Linux-based Audio programs (avconv and oggCat):

  • combine the “chunked” Google .mp3s into one .ogg representing a single line in the dialog: “avconv -i [mp3_input_file] -acodec libvorbis -q:a 5 [ogg_output_file]”
  • convert the Taken “line” .mp3 files to individual .ogg files: “avconv -i [mp3_input_file] -acodec libvorbis -q:a 5 [ogg_output_file]”
  • combine the final Google “line” .ogg files with the Taken “line” .ogg files: “oggCat [ogg_output_file] [ogg_input_file_1] [ogg_input_file_2] [ogg_input_file_3] ….”
  • convert the final .ogg file to .wav so all browser types will play nice (shakes fist at safari): “avconv -i [ogg_input_file] [wav_output_file]”

Anyways, if you haven’t checked out the final program yet it can be seen here http://taken.setfive.com/ and if you have any questions, comments, or feedback feel free to leave them below.

Posted In: Tips n' Tricks