Doctrine2 QueryBuilder Executable SQL Without Running The Query

On one of our projects that I am working on I had the following problem: I needed to create an aggregate temporary table in the database from a few different queries while still using Doctrine2. I needed to aggregate the results in the database rather than memory as the result set could be very large causing the PHP process to run out of memory. The reason I wanted to still use Doctrine to get the base queries was the application passes around a QueryBuilder object to add restrictions to the query which may be defined outside of the current function, every query in the application goes through this process for security purposes.

After looking around a bit, it was clear that Doctrine did not support (and shouldn’t support) what I was trying to do. My next step was to figure out how to get an executable query from Doctrine2 without ever running it. Doctrine2 has a built in SQL logger interface which basically lets you to listen for executed queries and to see what the actual SQL and parameters were for the executed query.  The problem I had was I didn’t want to actually execute the query I had built in Doctrine, I just wanted the SQL that would be executed via PDO.  After digging through the code a bit further I found the routines that Doctrine used to actually build the query and parameters for PDO to execute, however, the methods were all private and internalized.  I came up with the following class to take a Doctrine Query and return a SQL statement, parameters, and parameter types that can be used to execute it via PDO.

<?php
namespace Example\Doctrine\Util;
use Doctrine\ORM\Query;
class QueryUtils
{
/**
* @param Query $query
* @return array An array with 3 indexes, sql the SQL statement with parameters as ?, params the ordered parameters, and paramTypes as the types each parameter is.
*/
public static function getRunnableQueryAndParametersForQuery(Query $query)
{
$sql = $query->getSQL();
$c = new \ReflectionClass('Doctrine\ORM\Query');
$parser = $c->getProperty('_parserResult');
$parser->setAccessible(true);
/** @var \Doctrine\ORM\Query\ParserResult $parser */
$parser = $parser->getValue($query);
$resultSet = $parser->getResultSetMapping();
// Change the aliases back to what was originally specified in the QueryBuilder.
$sql = preg_replace_callback('/AS\s([a-zA-Z0-9_]+)/',function($matches) use($resultSet) {
$ret = 'AS ';
if($resultSet->isScalarResult($matches[1]))
$ret.=$resultSet->getScalarAlias($matches[1]);
else
$ret.=$matches[1];
return $ret;
},$sql);
$m = $c->getMethod('processParameterMappings');
$m->setAccessible(true);
list($params,$types)= $m->invoke($query,$parser->getParameterMappings());
return ['sql' => $sql, 'params' => $params,'paramTypes' => $types];
}
}
<?php
// Example Usage
// Get a Query Builder
$qb = $this->getDoctrine()
->getRepository('AppBundle:Example')
->getQueryBuilderForParams($params);
$queryInfo = QueryUtils::getRunnableQueryAndParametersForQuery($qb->getQuery());
// Use the query to insert to a temp table.
$this->getDoctrine()
->getManager('default')
->getConnection()
->executeQuery("INSERT INTO my_temp_table (col1,col2,col3,col4) ".$queryInfo['sql'],
$queryInfo['params'],
$queryInfo['paramTypes']);

In the ExampleUsage.php file above I take a query builder, get the runnable query, and then insert it into my temporary table. In my circumstance I had about 3-4 of these types of statements.

If you look at the QueryUtils::getRunnableQueryAndParametersForQuery function, it does a number of things.

  • First, it uses Reflection Classes to be able to access private member of the Query.  This breaks a lot of programming principles and Doctrine could change the interworkings of the Query class and break this class.  It’s not a good programming practice to be flipping private variables public, as generally they are private for a reason.
  • Second, Doctrine aliases any alias you give it in your select.  For example if you do “SELECT u.myField as my_field” Doctrine may realias that to “my_field_0”.  This make it difficult if you want to read out specific columns from the query without going back through Doctrine.  This class flips the aliases back to your original alias, so you can reference ‘my_field’ for example.
  • Third, it returns an array of parameters and their types.  The Doctrine Connection class uses these arrays to execute the query via PDO.  I did not want to reimplement some of the actual parameters and types to PDO, so I opted to pass it through the Doctrine Connection class.

Overall this was the best solution I could find at the time for what I was trying to do.  If I was ok with running the query first, capturing the actual SQL via an SQL Logger would have been the proper and best route to go, however I did not want to run the query.

Hope this helps if you find yourself in a similar situation!

S3Grep – Searching S3 Files and Buckets

On a project we were working on recently it appeared that we had data coming into our Extract, Transform, Load (ETL) processes which should have been filtered out. In this particular case the files which we imported only would exist at max up to 7 days and on any given day we’d have tens of thousands of files that would be created and imported. This presented a difficult problem to trace down if something inside our ETL had gone awry or if we were being fed bad data. Furthermore as the files always would be deleted after importing we didn’t keep where a data point was created from.

Instead of updating our ETL process to track where a specific piece of data originated from we wanted to basically ‘grep’ the files in S3. After looking around it doesn’t look like anyone has built a “Grep for S3”, so we built one. The reason we didn’t simply download the files locally and then process them one at a time is it’d take forever to transfer, then grep each one individual sequentially. Instead we wanted to do the search in parallel and not hold the entire files on the local disk.

With this we came up with our simple S3Grep java app (a pre-built jar is located in the releases) which will search all files in a specific bucket for a specific string. It currently supports both regex or non-regex search strings. You can specify how many threads you want it to use to process the files or it by default will try to use the same number of CPU’s on your machine. It utilizes the S3 Java adapter to read the files as a stream rather than a single transfer, than read from disk. Using the tool is very simple:

java -jar s3-grep-1.0-SNAPSHOT.jar s3grep.properties.

A the s3grep.properties file is a config file where you setup what you are searching for. An example:

s3.access_key=XXXX
s3.secret_key=YYYY
s3.bucket=my-bucket
search_term=my_search
is_regex_search=false
log_level=INFO
logger_pattern=%d{dd MMM yyyy HH:mm:ss} [%p] %m%n

For the most part this is self explanatory. The log level will default to INFO, however if you specify DEBUG it will output some more information such as what file’s it is currently checking. The logger_pattern parameter defaults to “%d{dd MMM yyyy HH:mm:ss} [%p] %m%n” and can be any pattern you want. For more information on the formatting visit the PatternLayout Documentation.

The default output format would look something like this:

22 Sep 2016 09:59:18 [INFO] my-logs/2016/09/14/23/MyApp.1473894000059:15674:abcefawed27be2305c30049c48552f8c46673340,2016-09-14 19:59:11
22 Sep 2016 09:59:18 [INFO] my-logs/2016/09/14/23/MyApp.1473894000059:15675:efwefwef8ce2b4bca75d8e00b8100815d9e2838b,2016-09-14 19:59:11
22 Sep 2016 09:59:18 [INFO] my-logs/2016/09/14/23/MyApp.1473894000059:15679:8b0cdwewewewed0831b4aa0e023f4874fc0f5799,2016-09-14 19:59:11
22 Sep 2016 09:59:18 [INFO] my-logs/2016/09/14/23/MyApp.1473894000059:15680:c5b4d66b9c30eeeeeeeea5b3d6b566877b2216dd,2016-09-14 19:59:11

If you want a little less verbose and more of just log lines you can update the logger_pattern to be just %m%n and end up with something similar to:

my-logs/2016/09/14/23/MyApp.1473894000059:15674:abcefawed27be2305c30049c48552f8c46673340,2016-09-14 19:59:11
my-logs/2016/09/14/23/MyApp.1473894000059:15675:efwefwef8ce2b4bca75d8e00b8100815d9e2838b,2016-09-14 19:59:11
my-logs/2016/09/14/23/MyApp.1473894000059:15679:8b0cdwewewewed0831b4aa0e023f4874fc0f5799,2016-09-14 19:59:11
my-logs/2016/09/14/23/MyApp.1473894000059:15680:c5b4d66b9c30eeeeeeeea5b3d6b566877b2216dd,2016-09-14 19:59:11

The format of the output is FILE:LINE_NUMBER:matching_string.

Anyways hope this helps you if you are trying to hunt down what file contains a text string in your S3 buckets. Let us know if you have any questions or if we can help!

Making Doctrine and Symfony Logged Queries Runnable

On many of our projects we use Gearman to do background processing.  One of problems with doing things in the background is that the web debug toolbar isn’t available to help with debugging problems, including queries.  Normally when you want to see your queries you can look at the debug toolbar and get a runnable version of the query quickly.  However, when its running in the background, you have to look at the application logs to see what the query is.  The logs don’t contain a runnable format of the query, for example they may look like this:

SELECT
t0.username AS username1, t0.username_canonical AS username_canonical2, t0.email AS email3,
t0.email_canonical AS email_canonical4, t0.enabled AS enabled5, t0.salt AS salt6, t0.password AS password7,
t0.last_login AS last_login8, t0.locked AS locked9, t0.expired AS expired10, t0.expires_at AS expires_at11,
t0.confirmation_token AS confirmation_token12, t0.password_requested_at AS password_requested_at13, t0.roles AS roles14,
t0.credentials_expired AS credentials_expired15, t0.credentials_expire_at AS credentials_expire_at16, t0.id AS id17,
t0.first_name AS first_name18, t0.last_name AS last_name19
FROM app_user t0
WHERE t0.id = ? LIMIT 1 [1] []

Problem is you can’t quickly take that to your database and run it to see the results. Plugging in the parameters is easy enough, but it takes time. I decided to quickly whip up a script that will take what is in the gist above and convert it to a runnable format. I’ve posted this over at http://code.setfive.com/doctrine-query-log-converter/ . This hopefully will save you some time when you are trying to debug your background processes.

It should work with both Doctrine 1.x/symfony 1.x and Doctrine2.x/Symfony2.x. If you find any issues with it let me know.

Good luck debugging!

High Performance With Netty and Aerospike

Recently we’ve been working with one of our clients to build application for use with AppNexus.  We were faced with a challenge which required a bunch of different technologies to all come together and work together.  Below I’ll try to list out how we approached it and what additional challenges we faced.

First came the obvious challenge:  How to handle at least 25,000 requests per second.  Our usual language of choice is PHP and knew it was not a good candidate for the project.  Instead we wanted to do some benchmarks on a number of other other languages and frameworks.  We looked at Rusty/Nginx/Lua, Go, Scala, and Java.  After some testing it appeared that Java was the best bet for us.  We initially loaded up Jetty.  We knew that this had a bit more baked in than we needed, but it was also the quickest way to get up and running and could be migrated away from fairly easily.    The idea overall was to keep the parsing of the request logic separate from the business logic.  In our initial tests we were able to get around 20,000 requests a second using Jetty, which was good, but we wanted better.

Jetty was great at breaking down the incoming HTTP requests to easily work with, it even provided an out of the box general statistics package.  However, we didn’t need much heavy lifting on the HTTP side, what we were building required very little complexity on with regards to HTTP protocol.   Jetty in the end was spending too many CPU cycles for what we needed.  We looked to Netty next.

Netty out of the box is not as friendly as Jetty as it is much lower level.   That said, it wasn’t too much work to get Netty up and running responding to HTTP request.  We ported over most of the business logic from our Jetty code and were off to the races.  We did have to add our own statistics layer as Netty didn’t have an embedded one for what we were looking for.  After some fine tuning with Netty we were able to start to handle over 40,000 requests per second.  This part of the puzzle was solved.

On our DB side we had heard great things about Aerospike in terms of performance and some of its features.  We ended up using this on the backend.  When we query Aerospike we have the timeout set at 3ms.  We’ll get around one or two request timeouts per second, or about 0.0025% of the time we’ll timeout, not too shabby. One of the nice features of Aerospike is the XDR function of the enterprise version.  With this we can have multiple Aerospike clusters which all stay in sync from a master cluster.  This lets us load our data onto one machine, which isn’t handling all the requests, and then it is replicated to the machines which are handling all the requests.

All in all we’ve had a great experience with the Netty and Aerospike integration.  We’re able to consistently handle around 40,000 requests a second with the average response time (including network time) of 4ms.

Tutorial: Create a HTML scraper with PhantomJS and PHP

This simple tutorial will show you how to create a PhantomJS script that will scrape the state/population html table data from http://www.ipl.org/div/stateknow/popchart.html and output it in a PHP application.  For those of you who don’t know about PhantomJS, it’s basically a headless WebKit scriptable with a JavaScript API.

Prerequisites:

1.  Create the PhantomJS Script

The first step is to create a script that will be executed by PhantomJS. This script will do the following:

  • Take in a JSON “configuration” object with the site URL and a CSS selector of the HTML element that contains the target data
  • Load up the page based on the Site URL from the JSON configuration object
  • Include jQuery on the page (so we can use it even if the target site doesn’t have it!)
  • Use jQuery and CSS selector from configuration object to find and alert the html of the target element. You’ll notice on line 37 that we wrap the target element in a paragraph tag then traverse to it in order to pull the entire table html.
  • We can save this file as ‘phantomJsBlogExample.js’
  • One thing to note is that on line 24 below we set a timeout inside the evaluate function to allow for the page to fully load before we call the pullHtmlString function. To learn more about the ins and outs of PhantomJS functions read here http://phantomjs.org/documentation/

var page = require('webpage').create();
page.onError = function (msg, trace) {
phantom.exit();
};
page.onAlert = function( msg ) {
console.log( msg );
if( msg == "EXIT" ){
phantom.exit();
}
};
page.open(config.url, function(status) {
page.includeJs('https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js', function() {
page.evaluate(function(config){
window.setTimeout(function(){
setInterval(function(){
pullHtmlString(config);
}, 2000);
}, 1);
}, config);
});
});
function pullHtmlString(config){
alert($(config.selector).wrap('<p/>').parent().html());
alert( "EXIT" );
}

2.  Create PHP function to run PhantomJS script and convert output into a SimpleXmlElement Object

Next, we want to create a PHP function that actually executes the above script and converts the html to a SimpleXmlElement object.

  • On line 3 below you’ll construct a “configuration” object that we’ll pass into the PhantomJS script above that will contain the site url and CSS selector
  • Next on line 10 we’ll actually read in the base PhantomJs Script we created in step 1. Notice that we actually make a copy of the script so that we leave the base script intact. This becomes important if you are executing this multiple times in production using different site urls each time.
  • On line 20 we prepend the configuration object onto the copied version of the phantomJS script, make sure you json_encode this so it’s inserted as a proper json object.
  • Next on line 29 we execute the phantomJs script using the PHP exec function and save the output into an $output array.  Each time the PhantomJS script alerts a string, it’s added as an element in this array. Alerted html strings will split out as one line per element in the array. After we get the output from the script we can go ahead and delete the copied version of the script.
  • Starting on line 38, we clean up the $output array a bit, for example when we initially inject jQuery in PhantomJS a line is alerted into the output array which we do not want as it doesn’t represent the actual html data we are scraping. Similarly, want to remove the last element of the $output array where we alert (‘EXIT’) to end the script.
  • Now that it’s cleaned up, we have an array of individual html strings representing our target data. We’ll want to remove the whitespace and also join all the elements into one big html string to use for constructing a SimpleXmlElement on line 49.

public function pullXmlObjBlogExample($siteUrl,$cssSelector){
//create configuration object containing jquery selector and target site url to pass to the phantom script
$config = array(
"selector"=>$cssSelector,
"url"=>$siteUrl
);
//read in the base phantom script and create a copy of it so we don't mess with the original base script
$templateScript = "phantomJsBlogExample.js";
$templateFileCopy = "phantomJsBlogExample-copy-".time().".js";
if (!copy($templateScript, $templateFileCopy)) {
echo "failed to copy $templateFileCopy";
return false;
}
//Prepend configuration object onto script
$configObj = file_get_contents($templateFileCopy);
$configObj = 'var config = ' . json_encode($config,JSON_UNESCAPED_SLASHES). ';' . "\n" . $configObj;
file_put_contents($templateFileCopy,$configObj);
//Run the phantom script with php exec function, redirect output of script to an $output array;
echo exec("phantomjs $templateFileCopy 2>&1",$output);
//delete the copied version of the phantom script as we don't need it anymore
if ( !unlink( $templateFileCopy ) ) {
echo "failed to delete $templateFileCopy";
return false;
}
// The first element of the output will be message about adding jquery and the last element will be the 'EXIT' message from the script,
// lets remove those so all we have is the html lines
array_shift($output);
array_pop($output);
//remove any whitespace from the array elements and join all the html lines into one string of all the html
$output= array_map('trim', $output);
$output = join("",$output);
//construct an XML element from the html string
$xmlObj = new \SimpleXMLElement($output);
return $xmlObj;
}

3.  Call the function and iterate through the SimpleXmlElement Object to get to the table data

  • Call the function from step 2 making sure to pass in the target site url and CSS selector
  • Now that we have the SimpleXmlObject on line 7 we’ll want to iterate through the rows of the table body and pull out the state name and population table cells. It may help to var_dump the entire SimpleXmlObject to get a sense for what the structure looks like.
  • For purposes of this example we’ll just echo out the state name and population but you could really do anything you wanted with the data at this point (i.e., persist to database etc.)

private function scrapePopulationsByState(){
$cssSelector = "table.sk_popcharttable";
$siteUrl = "http://www.ipl.org/div/stateknow/popchart.html";
$tableXmlObject = pullXmlObjBlogExample($siteUrl,$cssSelector);
$cnt = 0;
foreach($tableXmlObject->tbody->tr as $tableRow){
//the first two rows are the header and "All United States" rows so disregard
if($cnt++ < 2)
continue;
//grab the state and population from the corresponding table cell of the row and output!
$state = (string) $tableRow->td[1]->a;
$population = (string) $tableRow->td[2];
echo $state . " has a population of " . $population . "\n";
}
}

4.  Final Output

Finally, running the function from step 3 should result in something like this.

California has a population of 37,253,956
Texas has a population of 25,145,561
New York has a population of 19,378,102
Florida has a population of 18,801,310
Illinois has a population of 12,830,632
Pennsylvania has a population of 12,702,379
Ohio has a population of 11,536,504
Michigan has a population of 9,883,640
Georgia has a population of 9,687,653
North Carolina has a population of 9,535,483
New Jersey has a population of 8,791,894
Virginia has a population of 8,001,024
Washington has a population of 6,724,540
Massachusetts has a population of 6,547,629
Indiana has a population of 6,483,802
Arizona has a population of 6,392,017
Tennessee has a population of 6,346,105
Missouri has a population of 5,988,927
Maryland has a population of 5,773,552
Wisconsin has a population of 5,686,986
Minnesota has a population of 5,303,925
Colorado has a population of 5,029,196
Alabama has a population of 4,779,736
South Carolina has a population of 4,625,364
Louisiana has a population of 4,533,372
Kentucky has a population of 4,339,367
Oregon has a population of 3,831,074
Oklahoma has a population of 3,751,351
Connecticut has a population of 3,574,097
Iowa has a population of 3,046,355
Mississippi has a population of 2,967,297
Arkansas has a population of 2,915,918
Kansas has a population of 2,853,118
Utah has a population of 2,763,885
Nevada has a population of 2,700,551
New Mexico has a population of 2,059,179
West Virginia has a population of 1,852,994
Nebraska has a population of 1,826,341
Idaho has a population of 1,567,582
Hawaii has a population of 1,360,301
Maine has a population of 1,328,361
New Hampshire has a population of 1,316,470
Rhode Island has a population of 1,052,567
Montana has a population of 989,415
Delaware has a population of 897,934
South Dakota has a population of 814,180
Alaska has a population of 710,231
North Dakota has a population of 672,591
Vermont has a population of 625,741
Washington, D. C. has a population of 601,723
Wyoming has a population of 563,626