Archive for the ‘Perl’ Category

Facebook Spider

Thursday, January 26th, 2006

I’ve written a Perl program to spider Facebook. I was looking for a way to quickly generate statistics about the University of California, Berkeley student population, and I figured that since almost everybody had a Facebook account, I could dump all of Facebook’s information into a database and generate reports from that information. Since this program has proven useful, I’ve decided to release it to the general public.

How It Works

If you’re unfamiliar with the term spider, I recommend that you read the Wikipedia page on web spiders for a thorough discussion of how a spider works. In a nutshell, my program goes to a Facebook user’s profile, scans their friends list for other profiles, visits each of their profiles, scanning their friends list, and so on. Along the way, my program also scans a user’s profile for information, parses it, and inserts it into a SQL database.

Features

I’m only aware of one other Facebook spider: a Perl script written by Michael Kelly. However, Michael’s script only collects information about user’s friends. My script captures all the information available in a user’s profile (except for the ‘About Me’ field). Furthermore, my script provides the following enhancements:

  • Multi-threaded support. Each user’s profile is processed in its own thread. The total number of threads can be set using a command-line parameter, and the program uses semaphores to enforce the maximum number of threads.
  • SQL database storage. My script stores user information in a SQL database ordered by Facebook UID. I’ve used relatively simple queries throughout the script, so any SQL database should be supported (i.e., MySQL and PostgreSQL should work). However, I’ve chosen SQLite3 as the default database. If you wish to use another database type, install the appropriate DBD driver and modify the database handle line to use that driver.
  • Easy data processing. Since all data is stored in a SQL database, it should be relatively easy to write programs that query the database for information.
  • Sleep between threads. It’s possible to provide a value, in seconds, that my script should wait before spawning a new thread. This should prevent the script from overloading the Facebook servers.

Quick Start

Assuming you have all the necessary Perl modules and sqlite3 installed:

  1. Create a SQLite3 database:
    $ sqlite3 database.db ‘CREATE TABLE userdata ( uid integer, name, friends, school, status, sex, concentration, residence, hometown, highschool, screenname, mobile, website, lookingfor, interestedin, relationshipstatus, politicalviews, interests, clubsjobs, favoritemusic, favoritemovies, favoritebooks );’
  2. Create a facebook.conf:
    $ cp facebook.conf.sample facebook.conf
    $ vim facebook.conf
  3. Start the script:
    $ ./facebook.pl -t 2 -s 10 -f database.db [SOME FACEBOOK UID]

I Want It!

The script has been removed at Facebook’s request.

Notes

I haven’t tested the script lately, but it should still work. If it doesn’t, post a comment, and I’ll release an update.

Since my script parses the HTML returned from Facebook, if Facebook makes any changes to their profile layouts, I’ll have to make major modifications to the code.

Future

I’m in the process of designing an interface to Facebook that resembles Google Maps. Users will be able to interactively visualize their friend network, and clicking a user’s “node” should bring up their Facebook profile in a new window. More details will be forthcoming.

Daily del.icio.us Links Script for Wordpress 1.0

Saturday, October 29th, 2005

I’ve released a new version of my Daily del.icio.us Links Script for Wordpress. This new version introduces the use of the Net::Delicious and DateTime Perl modules to parse del.icio.us links and handle timezone differences between GMT and WordPress. As all known issues have been resolved and all basic functionality implemented, I am tagging this release version 1.0. Please refer to the latest “Links” post to see an example of the new version’s output.

Get the script here and check here if you need help.

Back in Business

Thursday, October 20th, 2005

Now that I’ve survived my first wave of midterms, I plan on devoting more time to my extracurricular activities.

I’ve noticed that my daily del.icio.us script for Wordpress is broken, so I’ll be updating it to work again, and I’ll be finally addressing the nagging issue of time sync in that script. I also just finished setting up a Xen server for the System Administration for the Web class, so I’ll be sharing my experiences building a Debian Xen server via a comprehensive Xen 2.0.7 HOWTO — one cannot begin to describe the lack of documentation for Xen. Lastly, I’ll be updating some projects that are currently off-limits to the public. I’ll be releasing more information about them when they reach a usable state.

UPDATED: Daily del.icio.us Links Script for Wordpress

Tuesday, May 10th, 2005

I’ve released a new version of my Daily del.icio.us Links Script for Wordpress. This new version introduces a code cleanup and the ability to show the tags associated with a link as well as links to your del.icio.us page of those tags. Please refer to the latest “Links” post (see below) to see an example of the new version’s output.

I’m aware of the timing issue associated with the script — del.icio.us uses Zulu time, which makes it hard to determine which links should fall under which day. Once I’m done with finals, I’ll hopefully have time to release a new version with a fix for this bug.

UPDATE: I’ve moved hosting of my Daily del.icio.us Links Script to labs.evilcoder.com. Find the script here.

Backup A Directory With Email

Saturday, May 7th, 2005

I wasn’t able to sleep, so I wrote this Perl script to backup a directory and mail yourself a copy. Change the variables in CAPS to suit your needs. It’s best to run this script as a cronjob.

Download a plaintext copy here.

Updated 5/8/2005: I’ve modified the code to use the File::Find library instead of a glob. My previous iteration of the code ignored hidden files (easily fixable by adding a second glob) and sub-directories. This new version fixes those two issues.

#!/usr/bin/perl

# This script creates an in-memory gzipped tar archive of the $BACKUP_DIR
# and emails the archive to an email account.

# Copyright Stephen Le
# Last Modified: May 8, 2005
# http://stephen.evilcoder.com

use MIME::Lite;
use Archive::Tar;
use Compress::Zlib;
use File::Find;
use strict;

my $SENDER      = 'sender@domain.com';
my $RECIPIENT   = 'recipient@domain.com';
my $BACKUP_DIR  = '/directory/to/backup';

my $tar = Archive::Tar->new;
find(\&wanted, $BACKUP_DIR);
sub wanted {
        $tar->add_files($File::Find::name);
}

my $msg = MIME::Lite->new(
    From    => $SENDER,
    To      => $RECIPIENT,
    Subject => "Backup of " . $BACKUP_DIR . " at " . localtime,
    Type    => "multipart/mixed",
);

$msg->attach(
    'Type'     => 'application/octet-stream',
    'Encoding' => 'base64',
    'Filename' => "backup.tar.gz",
    'Data'     => Compress::Zlib::memGzip( $tar->write )
);

$msg->send;

Daily del.icio.us Links Script for Wordpress

Sunday, February 27th, 2005

I’ve been using this PHP script to generate the daily “Links” posts from my del.icio.us links. Although the script works, it has some annoying bugs. For example, if there are no links for the day, it’ll still make a post which I have to manually delete. Furthermore, it’s written in PHP, a language I haven’t gotten around to learning, so I can’t fix or maintain it.

I decided to re-write the script in Perl, a language that I’m much more comfortable using. And so, I present my Daily del.icio.us Links Script for Wordpress. The code is pretty self-explanatory, and there are a few variables you’ll have to set.

Once the configuration variables are set, just add the script to your crontab — it should run once a day, around midnight, GMT time. It will use Net::Delicious to get a list of your del.icio.us links, filter out the links that don’t match the current date, create a nifty HTML version of your links, and post them to your Wordpress blog. If you’d like to see a sample of the output, click here.

Enjoy.

Note: It has come to my attention that del.icio.us now offers a service that does pretty much the same thing as my script, automatically. Natalie Downe has posted instructions on her blog for how to get the service to work with Wordpress blogs. If you want more control over your posts, though, I still recommend my script (OK, and maybe I’m a bit biased ;) ).

Wordpress 2.3 Support: Thanks to everyone for bringing the WordPress 2.3 incompatibility to my attention. I am no longer an active user of del.icio.us. However, I do plan on releasing an update to my script that will provide support for the latest version of WordPress sometime during my Thanksgiving break. In the mean time, Edward de Leau has modified my script to support WordPress 2.3 and added some other nifty features.

Features:

  • Completely self-contained, designed to run as a cronjob.
  • Automatically filters out links not matching the current date.
  • Adds links to tags used.
  • Users may specify post slug, post title, and trackback/comment status.
  • Uses Net::Delicious to maintain future compatibility with any del.icio.us API changes.

Changelog:

  • 10/29/2005: Modified code to use Net::Delicious and DateTime Perl modules to maintain compatibility with future del.icio.us API changes and resolve timezone issues.
  • 8/29/2005: Fixed reversal of gmtime and localtime. Thanks squish!
  • 5/9/2005: I’ve cleaned up the code and added the ability to add links back to the tags you used for a link.

Versions: