Archive for tag "RSS"

As alluded to last week, Blogawa is undergoing a reskinning. The new site will be pretty much the same in terms of functionality, except that it should look a bit nicer. The only major change will be the use of Gravatars to provide avatars for authors.

The URL for the RSS feed will change. I should be able to set up a forward to send your RSS reader to the appropriate place, but I may not. So if you find that your feed reader breaks, come back to blogawa and resubscribe.

Aaaany day now…

The web is full of data. Many, many websites are thin scripts pulling data out of a database, formatting it in HTML, and presenting it in a way that is (hopefully) human readable. In an ideal world these websites would provide an API that would give programmers a way of sucking that information into their own creations. Sadly, banks, libraries, online stores, and other data providers aren’t that altruistic. They are only going to provide APIs when their customers demand it.

Now imagine that we could impose an API onto websites. We could say “every bank will provide this API to give access to your financial history.” Obviously banks aren’t going to posse up and agree on a common interface, so we programmers will have to do it ourselves. Enter the idea of a “portable scraper.”

A portable scraper is a script that will consume one or more web pages, parse them, normalize the data, and then present the transformed in a well defined schema. That sounds like a scraper, right? Where does the whole portable thing come in?

Portability would allow programmers to write a scraper once and share it with other coders. The scraper would be

  1. Well defined, meaning that the inputs (ie, credentials and parameters) are easy to read, and the output is in a format easily consumable by other programs
  2. Language neutral, meaning that it would run in any host interpreter (eg: Python, Perl, PHP, Java, Ruby, JavaScript, etc)
  3. Auditable, meaning that programmers can read a scraper quickly and get a good idea of what it’s doing
  4. Secure, meaning that programmers have a reasonable assurance that a scraper isn’t leaking sensitive data to unauthorized third parties

I’m essentially suggesting a domain specific language that runs in a virtual machine. The language defines how to get information out of some website, while the virtual machine limits who the scraper can talk to.

Let’s look at an example. In some alternate universe, I have an account with the PiePalace National Bank (PPNB). Every week, I want to receive an SMS message telling me if I’m spending beyond my budget. Since the PPNB doesn’t provide an API for their customers (shame!), I have to use a portable scraper to pull my account history from the PPNB website. Happily, some other programmer has already faced this problem and has published a portable scraper to do the job. It looks like:

// Tell our caller which website we need to be able to access.
// The scraper won't be able to access pages outside of this hierarchy
require access 'https://natbank.pp/login.php'; // Login page
require access 'https://accounts.natbank.pp/*'; // Portion of the website providing acct history
	
// Tell the caller that we need certain parameters
require input 'bankCardNumber'; // The user's debit card number
require input 'password'; // The user's password
require input 'accountNumber'; // The number of the account, as shown on PPNB web pages
	
// Provide an interface for the return value. Quasi-BNF.
export output {
  HISTORY = row*; // We must provide a history element that has zero or more rows
  row = String[4]; // Indicate that each row has exactly four string elements
};
	
// Start at the login page to get our cookies
$browser = new Browser();
$browser get 'https://natbank.pp/login.php';
$form = $browser chooseform 'LoginForm';
$form set 'username' $INPUT{'bankCardNumber'};
$form set 'password' $INPUT{'password'};
$browser submit $form;
	
// Follow a link to our account history
$browser follow ($browser chooseLink $INPUT{'accountNumber'});
	
// Consume the account history
$toReturn = new Array();
	
while (true) {
  $table = $browser chooseTable '#acctHistory';
	
  $table runOnEachRow [ $row |
    // Parse each row in the table containing our account history
    $date = $row get 0;
    $amount = $row get 1;
    $who = $row get 2;
    $balanceAfter = $row get 3;
	
    $toReturn push new Array($date, $amount, $who, $balanceAfter);
  ];
	
  // Follow the link to the next page, (if it exists)
  $nextPageLink = $browser chooseLink 'Next Page';
  if ($nextPageLink == nil) {
    break;
  }
	
  // Read the entire history
  $browser follow $nextPageLink;
}
	
return $toReturn;

(The above is kinda Smalltalk: each line starts with an object reference, followed by the method to call. Each line is terminated by a ‘;’. if, while, break behave as you’d expect. Closures are defined between [], with parameters passed in to the first value.)

Notice that the script is written in a domain-specific language to handle the HTMLisms of the data being parsed. It contains a preamble that states which websites the script will need to visit (which are enforced by the VM), the input parameters, and the output format. A programmer using the scraper just has to properly call the script and use its return values.

Use Cases

Here are a few examples of scenarios that a portable scrapers could excel at:

  1. Securely finding the current balance on a bank account
  2. Querying the price of an item in an online store
  3. Querying a website for a list of upcoming events
  4. Querying OC Transpo for a bus schedule
  5. Querying the OC Transpo Travel Planner for a travel plan between two points

Portable Scrapers vs. Plager

When I floated this idea to dave0, he rightly asked “how is this different from Plagger?” As far as I can tell, Plagger is intended to be a processor for RSS feeds: it sequentially runs a series of plugins on a blackboard that contains at least one RSS feed. Portable scrapers would differ in that:

  1. they would handle arbitrary data types. Whereas Plagger (only?) dumps an RSS feed, a portable scraper would hand arbitrary data back to the caller.
  2. Plagger is the application. As far as I can tell, it doesn’t accept parameters (outside those written into its config files), and it isn’t designed to pass data back to another process. In other words, plagger is intended to do the full computation, where a portable scraper is only intended to get data for further processing.
  3. Plagger modules are fully trusted. There is no programmatic mechanism to stop a Plagger plugin from leaking data (either through files or across the network).

Comments?

I’ve put together a new version of Miniposts2. It’s now Wordpress 2.5 compatible, and supports filtering miniposts from feeds. Along the way I found the migration doc useful. And I came to discover that Wordpress doesn’t really maintain any kind of backward compatibility between minor revisions.

On the off chance you’re interested in the RSS feeds that I read, here’s a quick rundown:

Local

Blogawa.ca
Blog aggregator for Ottawa-related blogs. I wrote the aggregator, so you should read it. =)
Runesmith’s Canadian Content
The rambling of Jennifer Smith. I enjoy her ongoing outrage at the Conservative government.
Ottawa LiveJournal Community
It’s more of a “where can I get X” listing, but it’s sort of interesting to see what the kids are up to.
THE CANADIAN DESIGN RESOURCE
A near daily listing posting of random bits of Canadian design from the past hundred or so years. I have no idea why their name is in ALL CAPS, but that’s the way it’s presented in their feed.

Geekery

Lila’s Dreams Blog
Lila’s Dreams is a dev blog for an upcoming web-based MMOG. The setting is inside the psyche of an 11 year old girl. I’m not sure what the game is going to end up being, but it sounds like gardening should be a large part of game play, which sounds quite neat.
Dubroy.com/blog
I went to school with Pat, and he’s blogging as a grad student, which is a lifestyle that’s dear to my heart. He opines about usability, the evils of hierarchical filesystems, and difficulties installing stuff on Macs. I disagree with most things he says, but he’s well read and he comes at problems from the right angle.
datalibre.ca
Breathless open data zealots who think freely available data is a really good thing. They don’t trouble themselves with the hard questions of data ownership (curation, metadata, dealing with licensing/access restrictions) but approach the problem from a public interest standpoint. I’m not sure why I read this blog.
The Online Photographer (TOP) and Photoborg
I’m not sure why I read these sites. They’re kinda/sorta about photography. I’m looking for something with a few more tips, but I do enjoy the opining.

Funnies: Defective Yeti, xkcd, I Can Has Cheezburger?

My pet project, Blogawa.ca now produces RSS of the contributor’s feeds. Huzzah! Next project: Make Blogawa track/republish events in Ottawa.