Saturday, 21 February 2009

Regex : Matching a word, how hard can it be ?

You would be tempted to think that matching a word in sentence would be an easy thing to do with a regex.. but not a simple as you might have expected.

Lets take a typical twitter tweet for example where we want to pull out the URLs :

I found this cool site http://tinurl/somevalue and this one too http://tinurl/somevalue

I first tried :

(http:.*?)\s

But alas no, you don’t get the last url… which is obvious since the last ones does not have a trailing space – so knee jerk to only look for the space if it is thre

(http:.*?)\s*

umm dope.. now I only get the http: part ( duh! ) ….Right, now I will get you,  should be just as simple as searching for a space or the end of the line so -

(http:.*?)[\s|$] – fail
(http:.*?)[\s$] – fail
(http:.*?)[\s|\b] – fail

Right so I suck at regex, and someone better at this would probably solve this in their sleep…

What should have been a simple regex is starting to look overly complex so I found a cunning way to work around this – put a space at the end of the sentence – and hey presto you get them both with the first a simple looking regex – ahhh.

What to play around with regex’s with a nice visual feedback then check out http://www.gskinner.com/RegExr/

1 comment:

  1. You can use (http://.[^\s]+) which will work for any urls that don't contain spaces.

    ReplyDelete