Post your Google Buzz updates to Twitter with buzz2tw, a libre Perl script

Since Google Buzz was released I found it much more handy the Twitter or FriendFeed because it was always there, in my constantly-opened Gmail page. Nevertheless having so many contacts on twitter I could not move completely from one platform to the other, and I searched the internet for a post importer.

I already knew TwitterFeed so I tried giving it my Google Profile atom feed, but the result was terribly ugly: "Google Buzz" string was repeated a couple of times and there was too few space left for the post. My timeline looked like a spammer's one.

So I decided to create my own importer, using Perl and a Mysql Database. I will guide you trough the whole programming process but, in case you're impatient, below is the GIT repository and instructions to get it running in 2 minutes. The result is a nice and clean post, using goo.gl as url shortenere

GitHub Project Homepage : Buzz2tw Source Code Quick Start guide : README Quick Download : buzz2tw.pl Mysql Structure : structure.sql

Below are 10 chapters of a step by step how-to which will guide you to completely understand the code I created. Every line of code is Libre Software released under GPL3 licence.

Chap 0 - Setting up the Database Structure

A database is essential to correctly manage the posts. In fact, to avoid duplicates, we always need to know which posts we already published on Twitter and which ones we still need to sync. We only need one table for this purpose, so it makes no difference at all if you create it on an existing database or you create a new one.

To create the table just login via command line to your database

mysql -p -u username databasename

then paste the following code and press Return

CREATE TABLE IF NOT EXISTS `buzzs` (  
  `link` varchar(32) NOT NULL,
  PRIMARY KEY  (`link`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;

As you should notice there's just one string field in our table and it will contain the md5 hash of the link. The link, differently from title, time-stamp or description is the only field retrieved from Google atom feed which is unique and keeps being unique in time.

The purpose of the table is simple: every time we post a new article to twitter, we create the md5 hash of the Google Buzz link and then put the result inside the table. Every time we're going to post a new update, we check the table first to be sure we're not posting a duplicate.

If you don't know how, or don't want to, use the command line, feel free to import the SQL code via your preferred GUI (eg Phpmyadmin)

Chap 1 - User Configuration

The first row of a Perl script in an Unix-like environment has to be the path to the Perl interpreter, usually located in /usr/bin/perl or /usr/local/bin/perl. So, in my code you can read

#!/usr/bin/perl

This is very useful because as long as you make the script executable you will be able to launch it as follows

chmod +x ./buzz2tw.pl  
./buzz2tw.pl

After the interpreter line and the licence, we're going to declare some variables to simplify the configuration of the script. Comments in the code are self explaining, we just need the Google username, the Twitter access data and obviously the database connection parameters.

The only parameter you need to pay a bit more attention to is called "lmpost" and I used this to limit the number of posts to publish every time the script runs. Lets imagine, for example, that you're starting the script for the first time. If you don't limit the number of tweet, you will flood your timeline with a lot of posts. Using this limit you prevent too many updates to be posted simultaneously.

So below you can find an example code with fake parameters

# Google Buzz Username  
my $bzuser = ""; # EG andrea.olivato

# Twitter username and password
my $twuser = ""; # EG andreaolivato  
my $twpass = "";

# Database Configuration
my $dbhost = "localhost";  
my $dbuser = "buzz2tw";  
my $dbpass = "";  
my $dbname = "buzz2tw";

# Running parameters
# lmpost: how many post publish for each execution of the script
my $lmpost = 3;

Chap 2 - System configuration

There are another couple of parameters we need to setup before proceeding with the "running" part of the script. Perl, as you should know, can be expanded by modules which are included via the USE routine. We need some of them for running the script easily.

First of all we need the mysql connection working. This is achieved by "using" the DBI module. Please note that you need to install the mysql driver for the DBI module using cpan or your distribution package manager.

For parsing the Google atom feed containing the posts we need the XML parser, provided by the XML::Simple module. Moreover we are going to make some HTTP calls, so we are using the LWP::UserAgent module.

Furthermore, we said we are going to store the md5 hashs of the links in our table, so we need the Digest::MD5 module. At last but not least, we declare we are using a Strict Perl syntax, to avoid stupid errors and correctly learn the language.

All this commands and requirements are performed by the following code

use strict;  
use XML::Simple;  
use LWP::UserAgent;  
use DBI;  
use Digest::MD5 qw(md5 md5_hex md5_base64);

At the end of this section, which we call a system configuration because it's external to the running code but the end user is not required to change it, we declare the url of the atom feed containing Google Buzz posts. I did so because Google Buzz APIs are still in development and this url is subjected to changes. As you may notice, the url contains the previously declared "bzuser" variable.

my $url = "http://buzz.googleapis.com/feeds/".$bzuser."/public/posted";

Chap 4 - XML parsing and the article Cycle

The first thing our script needs to do is connecting to the Mysql database we set up before and keep the connection ready for our queries. The connection procedure is terribly easy thanks to the DBI module. Please note that we're using our previously declared database configuration variables inside the connection string

# Connecting to the database  
my $DSN = "dbi:mysql:database=".$dbname.";host=".$dbhost.";user=".$dbuser.";password=".$dbpass;  
my $dbh = DBI->connect($DSN);

We can now proceed by downloading the feed from Google servers and start parsing it with our Xml::Simple module. To achieve such a result we first create a new Object for the connection

my $ua = new LWP::UserAgent;

then tell the object that we want to reach the $url using a GET request

my $req = new HTTP::Request GET => $url;

and finally that we want to store the whole response into a local variable called $content

my $content = $ua->request($req)->content();

Having the complete XML code stored into our variable, we can proceed by parsing it into an hash as shown below

my $xml = new XML::Simple (KeyAttr=>[]);  
my $data = $xml->XMLin($content);

The "data" variable is now an hash containing the various nodes of the XML file. If you have a look at the XML tree you will notice that it's made of several entries, each of which represents a single article. Each article has then some parameters, from which we are going to extract the plain text of the post and the link.

The next step is to cycle trough the XML, opening each article. To achieve so, we are going to use the foreach expression. Also we'll setup a "limit" variable, which is needed to count how many articles we are writing. Here is the resulting code

# Cycling each feed post  
my $limit = 0;  
foreach my $item (@{$data->{entry}}) {  
    # NEXT CODE HERE
    $limit++;
}

Chap 5 - Inside the cycle, each getting article info

The following code has to be inserted inside the cycle we previously started, and refers to the "item" variable which is now representing a single article imported from the XML. Because we setup a limiting variable and a counter (the "limit"), the first step of our cycle is to skip the cycle itself in case we exceeded the limit. Below is shown how to code this, using the NEXT statement, which is very similar to PHP continue

    # Skip if I already published too many posts  
    next if($limit>=$lmpost);

Being sure we did not surpass the limit, we can continue with our parsing. As I said before, we can retrieve the information we want (text and link) directly from the item variable. Because the text that the atom feed returns is in plain format, it contains \n new line characters which would create ugly effects on our tweet. To avoid them, we are going to replace each \n char with a space using the Perl regular expression operator "~"

    # Getting content of the post and its link  
    my $text = $item->{summary}->{content};
    $text =~ s/\n/ /gi;
    my $link = $item->{link}[0]->{href};

Chap 6 - Checking for duplicates

Wonderful! We got everything we need for our article. It's now time to check for duplicates on our database. Using the connection we opened before we are going to perform a SELECT statement looking for the number of articles having the same link of the one we're analyzing. To do so, we'll hash the link using the md5 function as follows

    # Checking the link in the database. Don't want duplicate posts  
    my $checkmd5 = "SELECT count(*) FROM buzzs WHERE link = md5('".$link."')";
    my $go_c = $dbh->prepare($checkmd5);
    $go_c->execute;
    my $fe_c = $go_c->fetchrow_array();
    $go_c->finish();

As you can see from the above code, the query is prepared, executed and then its results are stored in the fec variable. This fec variable is used to check if we found duplicates or not. Furthermore we are going to move the "$limit++;" expression inside the if statement we are realizing. In fact we don't wan't to skip after N cycles of articles but after N cycle of articles which we haven't put in the database yet. So below you can find the new cycle structure.

# Cycling each feed post  
my $limit = 0;  
foreach my $item (@{$data->{entry}}) {

    # Skip if I already published too many posts
    next if($limit>=$lmpost);

    # Getting content of the post and its link
    my $text = $item->{summary}->{content};
    $text =~ s/\n/ /gi;
    my $link = $item->{link}[0]->{href};

    # Checking the link in the database. Don't want duplicate posts
    my $checkmd5 = "SELECT count(*) FROM buzzs WHERE link = md5('".$link."')";
    my $go_c = $dbh->prepare($checkmd5);
    $go_c->execute;
    my $fe_c = $go_c->fetchrow_array();
    $go_c->finish();

    # If no duplicates are found
    if (defined($fe_c) && $fe_c<1) {
        # MAIN CODE HERE
        $limit++;
    }
}

Chap 7 - Shortening the link

There are plenty of services offering API services to shorten links to post on Twitter. I chose goo.gl because I find it much more buzz-themed then the others. So, as the service is not properly public, gaining access to their API is not immediate. Searching on Google I found GGL Shortener, a 3rd party API service which connects to goo.gl APIs and returns the shortened url. Integrating with them is very easy, much more then creating a class to connect to goo.gl directly so I definitely preferred them.

To interface our script with their API system we just need to urlencode the destination link, make a GET request and retrieve the shortened url. Please note again that the following code has to be placed inside the foreach loop, inside the if checking for duplicates.

First of all, to urlencode the url exactly like PHP does, we need to perform a regular expression replacement, explained in the code below.

        # Urlencoding the link for GET request to the url shortener  
        my $urlencodedlink = $link;
        $urlencodedlink =~ s/([^A-Za-z0-9])/sprintf("%%%02X", ord($1))/seg;

I found this regex here, all credits go to the original author.

Having our link urlencoded we can proceed performing another GET request and retrieving the answer into another variable.

        # Calling ggl-shortener APIs to get a goo.gl shortened url  
        my $ua2 = new LWP::UserAgent;
        my $req2 = new HTTP::Request GET => 'http://ggl-shortener.appspot.com/?url='.$urlencodedlink;
        my $res = $ua2->request($req2);
        my $content = $ua2->request($req2)->content();

The "content" variable contains the JSON returned by the API, which is something like

{"short_url":"http://goo.gl/HTsh"}

If we liked to be precise we would need to parse this via a JSON module (eg simplejson). However this string is always the same and we can easily get our shortened url with a regular expression matching as shown below

        $content =~ m/short_url":"([^"]*)"/;  
        my $shortlink = $1;

Chap 8 - Tweeting the post

We got the text and the shortened link, so we can safely post our tweet. Because of the twitter limitations, we have to ensure that our tweet won't be longer then 140 chars. To do so, we're going to shorten the tweet and then add the link to it. Moreover, as we are going to post this via another HTTP request, we are going to urlencode the whole thing as we did before

        # Cutting text of the post and adding the shortened link  
        my $tweet = substr($text,0,106).'... '.$shortlink.'';
        $tweet  =~ s/([^A-Za-z0-9])/sprintf("%%%02X", ord($1))/seg;

There we go. Now we create a new HTTP call, but this time it has to be a POST one, so we change the object and we add the "content" variable, which contains our post. Moreover we need to use our twitter username and password to post, so we're going to use the authorization_basic method of our HTTP object. Here's how to do so

        # Posting to Twitter  
        my $ua3 = new LWP::UserAgent;
        my $req3 = new HTTP::Request POST => 'http://twitter.com/statuses/update.xml';
        $req3->authorization_basic($twuser, $twpass);
        $req3->content("status=$tweet");
        $ua3->request($req3);
        my $content3 = $ua3->request($req3)->content();

Yes! We posted it!

Chap 9 - Avoiding future duplicates

Having the tweet posted we can dedicate to avoid future duplicates. Exactly in the same way we checked for the md5 of the article before we now need to put this link md5 into the database. The procedure is completely similar to the previous one but instead of the SELECT statement we are going to use an INSERT and obviously we do n't need to retrieve any data from the database.

        # Inserting the post in the database to avoid duplicates  
        my $insertmd5 = "INSERT INTO buzzs VALUES(md5('".$link."'))";
        my $go_i = $dbh->prepare($insertmd5);
        $go_i->execute;
        $go_i->finish();

Chap 10 - Closing the script

We're done, we just need to close the connection as follows. This needs to be inserted after the if, and after the end of the cycle

# Closing dbs connection  
$dbh->disconnect();

If you want to look at the complete code of this tutorial please check it out on github : buzz2tw.pl.

Possibly related posts: (automatically generated)

google buzz, perl, Twitter
Andrea Olivato