Regex To Extract URLs From Plain Text

Recently for a project we had the problem that it pulled data from numerous API’s and sometimes the data would contain urls that were not HTML links (ie. they were just http://www.mysite.com instead of <a href=”http://www.mysite.com”>http://mysite.com</a> .  I searched around the web for a while and had no luck finding a regex that would extract only urls that are not currently wrapped already inside of a html tag.  I came up with the following regex:

/(?<![\>https?:\/\/|href=\"'])(?<http>(https?:[\/][\/]|www\.)([a-z]|[A-Z]|[0-9]|[\/.]|[~])*)/

Parts of it are taken from other examples of URL extractors.  However none of the examples I found had lookarounds to make sure it isn’t already linked.  I am not a master of regex, so there may be a better expression than I wrote.  The above expression is written to be compatible with PHP’s preg_replace method.  A more generic one is as follows:

(?<![\>https?://|href="'])(?<http>(https?:[/][/]|www.)([a-z]|[A-Z]|[0-9]|[/.]|[~])*)

This expression will match http://www.mysite.com and www.mysite.com and any subdomains of a website.  The first matched group is the URL.  One thing to note is if you are using this that you need to check if the URL that is matched has an http:// on the front of it, if it does not, append one otherwise the link will be relative and cause something like http://www.mysite.com/www.mysite.com .

One tool that was very helpful in making this was http://gskinner.com/RegExr it is incredibly helpful.  It gives you a visual representation in real time as you create your expression of what it will match.

Note: You will lose the battle in trying to extract URL’s using regex. For example the above expression will fail on a style=”background:url(http://mysite.com/image.jpg)”. For a more robust solution it may be worth while looking into parsing the DOM and running regex per element then.

internOwl Launched!

Today we are proud to unveil  internOwl.  internOwl is a site for students to research internships and find them.  As the site grows students will be able to gain invaluable insight into the quality of different internships around the country.   Currently the site is being launched with a focus on targeting Massachusetts’ students.  We are excited to see how it performs.

If you are a student in the Amherst or Northampton area you can get a FREE burrito via the following url: http://www.internowl.com/bueno

We hope you all enjoy and there will be more updates about the site to follow as well as the technology used behind the site!

FOSS Fridays – Tracking Your Users

The other night I thought it’d be helpful to see how people browse your site.  I think you can probably learn a lot about how a user moves over your site.  You can tell a lot about your user’s experience by watching their mouse.  You can see where they look on the page for specific information. I’ve created a demo of the tracking.

I know that there are some products out there that already do user session tracking and replays.  Also there are click heat maps which are interesting when you are looking on your site to see what links the user clicks the most.  I decided to just rebuild the session replay just out of curiousity on how difficult it’d be to do.  It was fairly simple and took me only 15-20 minutes. There are a number of improvements you could make to the script such as the window.unload handling is not 100% depending on your browser.  You could also do much more parsing on the client side of the information by using JSON.  If you wanted to store more than one page of tracking data you could quickly modify the script to pass the name page which it was tracking and to store the data separately for each page.

To use the script all you need to do is add a little Javascript on the bottom of the page you want to track a user.   The script tracks a users mouse movement as soon as they open the site.  It keeps track of time so that during the replay you can get the proper mouse movement at the right times to replay the users session.  While the user moves their mouse it continues to store all the data client side.  Once the window is closed it sends all the information to the server.  The server simply parses the data string.  For session replay it is done via setTimeout and it moved an image(of a cursor) at different intervals to simulate the users session.

While the script is not very pretty it was written very quickly and just as a proof of concept.  It goes to show you can easily track your user’s session without having to purchase expensive products, and that it can be done fairly simply.

The code is below with descriptions of what each snippet does. The script uses jQuery. To deploy this on numerous pages all you really would need to add would be a script tag that pulls in the tracking javascript.

Tracking the users movements and sending the server the information javascript:

var points='';
var timeSeconds=0;
$(document).ready(function(){
  $('body').append('');
  timer();
   $().mousemove(function (e){
    points=points+e.pageX+","+e.pageY+","+timeSeconds+"|";
  });
  $(window).unload(function(){
    sendData();
  });
});
function timer()
{
  timeSeconds=parseInt(timeSeconds)+1;
  setTimeout("timer()",10);
}
function sendData(){
    $.post('index.php','data='+points);
}

To parse the information on the server:

if($_SERVER['REQUEST_METHOD']=='POST')
{
  $fp=fopen($_SERVER['REMOTE_ADDR'].'tracking.dat','w+');
   fwrite($fp,$_POST['data']);
  fclose($fp);
  exit(1);
}

To replay the session first get the data:

if(file_exists($_SERVER['REMOTE_ADDR'].'tracking.dat'))
{
 $data=explode("|",file_get_contents($_SERVER['REMOTE_ADDR'].'tracking.dat'));
}

The moving of the cursor image javascript:

function moveMouse(x,y){
 $('#cursor').attr('style','position:absolute;left:'+x+"px;top:"+y+"px;");
}

Create different calls for each time the mouse was moved and have them execute at the times the user moved the mouse:

foreach($data as $d)
{
  $parts=explode(',',$d);
  if(count($parts)==3)
  echo 'setTimeout("moveMouse('.$parts[0].','.$parts[1].')",'.($parts[2]*10).");\n";
}

And you are all set.  As I said the script is only for proof of concept and not too pretty.  Let me know if you have any questions.

Skinning your jQuery UI Components quick and easily – ThemeRoller

We use jQuery on almost every project we do. As many know updating your theme for your website widgets can take a long time. Recently we found the jQuery UI – ThemeRoller. This allows you to quickly skin all of your jQuery UI widgets within a matter of couple of mouse clicks.  For those of us who can’t pick matching colors for their life, ThemeRoller has many template themes. ThemeRoller allows you to start with a templated theme, and to easily modify it via the GUI.

This will save you time and money as hand editing the CSS files to update your jQuery UI widgets is slow and tedicious.

FOSS Fridays: sfSCMIgnoresTaskPlugin version 1.0.3 released – Doctrine Supported

In the past we have always used Propel as our main ORM for our Symfony projects. Recently with Doctrine becoming the default ORM for Symfony 1.3 we decided we should make sure our plugin supports Doctrine. For those who don’t know about the plugin, it automatically creates ignores for your Source Code Management(SCM) if you use Git or CVS. When you have a large project it gets tiring to create all the different ignores for all the bases, logs, configuration files and such.

Let us know if you find any problems with Doctrine support or have any additional suggestions for the plugin.

More information on the plugin can be found on the Symfony Plugins site: http://www.symfony-project.org/plugins/sfSCMIgnoresTaskPlugin

You can download the most recent pear package manually at: http://plugins.symfony-project.org/get/sfSCMIgnoresTaskPlugin/sfSCMIgnoresTaskPlugin-1.0.3.tgz

or you can install it via:

./symfony plugin:install sfSCMIgnoresTaskPlugin