#url extraction

Recently for a project we had the problem that it pulled data from numerous API’s and sometimes the data would contain urls that were not HTML links (ie. they were just http://www.mysite.com instead of <a href=”http://www.mysite.com”>http://mysite.com</a> .  I searched around the web for a while and had no luck finding a regex that would extract only urls that are not currently wrapped already inside of a html tag.  I came up with the following regex:

/(?<![\>https?:\/\/|href=\"'])(?<http>(https?:[\/][\/]|www\.)([a-z]|[A-Z]|[0-9]|[\/.]|[~])*)/

Parts of it are taken from other examples of URL extractors.  However none of the examples I found had lookarounds to make sure it isn’t already linked.  I am not a master of regex, so there may be a better expression than I wrote.  The above expression is written to be compatible with PHP’s preg_replace method.  A more generic one is as follows:

(?<![\>https?://|href="'])(?<http>(https?:[/][/]|www.)([a-z]|[A-Z]|[0-9]|[/.]|[~])*)

This expression will match http://www.mysite.com and www.mysite.com and any subdomains of a website.  The first matched group is the URL.  One thing to note is if you are using this that you need to check if the URL that is matched has an http:// on the front of it, if it does not, append one otherwise the link will be relative and cause something like http://www.mysite.com/www.mysite.com .

One tool that was very helpful in making this was http://gskinner.com/RegExr it is incredibly helpful.  It gives you a visual representation in real time as you create your expression of what it will match.

Note: You will lose the battle in trying to extract URL’s using regex. For example the above expression will fail on a style=”background:url(http://mysite.com/image.jpg)”. For a more robust solution it may be worth while looking into parsing the DOM and running regex per element then.

Posted In: General

Tags: , ,