You would be tempted to think that matching a word in sentence would be an easy thing to do with a regex.. but not a simple as you might have expected.
Lets take a typical twitter tweet for example where we want to pull out the URLs :
I found this cool site http://tinurl/somevalue and this one too http://tinurl/somevalue
I first tried :
(http:.*?)\s
But alas no, you don’t get the last url… which is obvious since the last ones does not have a trailing space – so knee jerk to only look for the space if it is thre
(http:.*?)\s*
umm dope.. now I only get the http: part ( duh! ) ….Right, now I will get you, should be just as simple as searching for a space or the end of the line so -
(http:.*?)[\s|$] – fail
(http:.*?)[\s$] – fail
(http:.*?)[\s|\b] – fail
Right so I suck at regex, and someone better at this would probably solve this in their sleep…
What should have been a simple regex is starting to look overly complex so I found a cunning way to work around this – put a space at the end of the sentence – and hey presto you get them both with the first a simple looking regex – ahhh.
What to play around with regex’s with a nice visual feedback then check out http://www.gskinner.com/RegExr/
You can use (http://.[^\s]+) which will work for any urls that don't contain spaces.
ReplyDelete