Using Regular Expressions to add links to tweets

regexThe usage of twitter by its own web interface or via the most recent clients, accustomed users to see linked @usernames and #hashtags inside any status update they read.
When using Twitter API on your own website, service or app, you need to deal with plain text tweet, with no tags so no links. Using Regular Expressions it is possible to create links easily and quickly, in quite any programming language.

As what really matters is the regular expression itself and not the programming language used, I would just show 2 examples, using PHP and PERL .

A quick introduction

Let's imagine we interrogated twitter API and got back a tweet like the following one


just saw @johndoe talking at #someforum about his product http://bit.ly/foo

to do a good job we need to create 3 links:

  • @johndoe will be linked to its twitter profile
  • #someforum to the related search on twitter
  • http://bit.ly/foo to its own link

What we want to achieve is to replace any of these strings with an HTML link which points to the correct URL. We obviously need to consider that it is possible to have more then one keyword for each type.

Replacing functions in PHP and Perl

As we said before, we have the necessity to replace some text with a link, let's see what functions we're going to use

PHP

Php allows developers to interface with regex substitutions using native function preg_replace . The needed syntax is explained below


preg_replace('/search/','replace',$source);

In our case search will be the regular expression that we are going to build, replace is the link and source our initial tweet.

Perl

Perl uses a different syntax, making use of ~ regular expression native operator


$tweet =~ s/search/replace/g;

where $tweet is both our source and the output, search is our regex and replace our link.

Look for @usernames

First of all let's create a regular expression able to recognize usernames quoted on a tweet


/@([a-zA-Z0-9_]*)/

As twitter usernames allow only alphanumeric characters plus the underscore the regular expression was built to match just them in the username, so that email addresses are excluded (they contain dots).

REGULAR EXPRESSION DETAILS

  • the initial @ is needed to start finding the username. It's outside the parenthesis so it's not returned by the regular expression and we need to remember to write it again the in output.
  • inside the parenthesis () there's everything will get back.
  • inside the squares [] there are the characters we do allow to be present in the username.  a-z is any lower Latin char, A-Z any uppercase char, 0-9 any number and _ , well the underscore
  • the asterisk * indicates that each of the previous character might be present for infinite times in our matching string. This just means we do not exactly know how long the username will be

CHOOSE THE RIGHT OUTPUT

The link needed for any @mention is the twitter profile of that user. As our regular expression will be returning just one value we'll use that value to construct that link. Remember that our regular expression will return just the username, without the [at]!


<a href="http://twitter.com/$1" title="$1 profile on Twitter" rel="nofollow">@$1</a>

Note that $1 is the username returned by the regular expression, and that rel="nofollow" is used for page rank safety.

Finally, having both the regular expression to search and the replacement link, we can proceed to create the code

PHP


$tweet ='just saw @johndoe talking at #someforum about his product http://bit.ly/foo';
$regex = '/@([a-zA-Z0-9_]*)/';
$link_pattern = '<a href="http://twitter.com/$1" title="$1 profile on Twitter" rel="nofollow">@$1</a>';
$tweet = preg_replace($regex,$link_pattern,$tweet);

Perl


my $tweet ='just saw @johndoe talking at #someforum about his product http://bit.ly/foo';  
$tweet =~ s/@([a-zA-Z0-9_]*)/<a href="http://twitter.com/$1" title="$1 profile on Twitter" rel="nofollow">@$1</a>/g;

Both the above codes (if echoed or printed ) would output

just saw @johndoe talking at #someforum about his product http://bit.ly/foo

Look for #hashtags

The regular expression needed for hashtags is just the same as the one we saw above for @usernames, obviously replacing the @ with an hash #.  The output will be similar too, but changing from the twitter domain to the search one like this


<a href="http://search.twitter.com/search?q=%23$1" title="search for $1 on twitter" rel="nofollow">#$1</a>

where %23 is the urlencoded symbol for the hash and all the other parameters have already been explained.

As you should understand everything about the previous regex I will just list the complete codes.

PHP


$tweet ='just saw @johndoe talking at #someforum about his product http://bit.ly/foo';
$regex = '/#([a-zA-Z0-9_]*)/';
$link_pattern = '<a href="http://search.twitter.com/search?q=%23$1" title="search for $1 on Twitter" rel="nofollow">#$1</a>';
$tweet = preg_replace($regex,$link_pattern,$tweet);

Perl


my $tweet ='just saw @johndoe talking at #someforum about his product http://bit.ly/foo';  
$tweet =~ s/\#([a-zA-Z0-9_]*)/<a href="http://search.twitter.com/?q=%23$1" title="search for $1 on Twitter" rel="nofollow">#$1</a>/g;

Both previous codes would output

just saw @johndoe talking at #someforum about his product http://bit.ly/foo

Look for links

Latest substitution we're going to perform is about links. To match the links we're going to look for any kind of word located between http and a space, or a parenthesis. Here's the regex


/http([s]?):\/\/([^\ \)$]*)/

REGULAR EXPRESSION DETAILS

  • http is outside our matching, it's just needed to search for urls in the tweet
  • ([s]?) is a simple trick to match both http and https urls. It matches and returns the 's' just if it exists "?"
  • :// is just the next part of the url
  • ([^\ \)$]*) matches infinite characters ( * ) that might be anything except "^" the space "\ " a closing parenthesis "\)" or the end of the string "$"

CHOOSE THE RIGHT OUTPUT
Obviously for a link we just need to put the <a> tag before the link and close it immediately after. However because of our precision on the protocol matching (http or https ) we need to use 2 variables in our output: $1 will be the s (if present), $2 will be the real url. Below is the complete output


<a href="http$1://$2" rel="nofollow" title="$2">http$1://$2</a>

So here are the two examples

PHP


$tweet ='just saw @johndoe talking at #someforum about his product http://bit.ly/foo';
$regex = '/http([s]?):\/\/([^\ \)$]*)/';
$link_pattern = '<a href="http$1://$2" rel="nofollow" title="$2">http$1://$2</a>';
$tweet = preg_replace($regex,$link_pattern,$tweet);

Perl


my $tweet ='just saw @johndoe talking at #someforum about his product http://bit.ly/foo';  
$tweet =~ s/http([s]?):\/\/([^\ \)$]*)/<a href="http$1://$2" rel="nofollow" title="$2">http$1://$2</a>

Both previous codes would output

just saw @johndoe talking at #someforum about his product http://bit.ly/foo

The complete code

Just as reference I will report for each language the complete code of this example.

PHP


$tweet ='just saw @johndoe talking at #someforum about his product http://bit.ly/foo';
$regex = '/http([s]?):\/\/([^\ \)$]*)/';
$link_pattern = '<a href="http$1://$2" rel="nofollow" title="$2">http$1://$2</a>';
$tweet = preg_replace($regex,$link_pattern,$tweet);
$regex = '/@([a-zA-Z0-9_]*)/';
$link_pattern = '<a href="http://twitter.com/$1" title="$1 profile on Twitter" rel="nofollow">@$1</a>';
$tweet = preg_replace($regex,$link_pattern,$tweet);
$regex = '/\#([a-zA-Z0-9_]*)/';
$link_pattern = '<a href="http://search.twitter.com/search?q=%23$1" title="search for $1 on Twitter" rel="nofollow">\#$1</a>';
$tweet = preg_replace($regex,$link_pattern,$tweet);

Perl

my $tweet ='just saw @johndoe talking at #someforum about his product http://bit.ly/foo';
$tweet =~ s/http([s]?):\/\/([^\ )$])/<a href="http$1://$2" rel="nofollow" title="$2">http$1://$2</a> $tweet =~ s/@([a-zA-Z0-9_])/<a href="http://twitter.com/$1" title="$1 profile on Twitter" rel="nofollow">@$1</a>/g; $tweet =~ s/#([a-zA-Z0-9_]*)/<a href="http://search.twitter.com/?q=%23$1" title="search for $1 on Twitter" rel="nofollow">#$1</a>/g;

Possibly related posts: (automatically generated)

perl, Php, Twitter
Andrea Olivato