Archive for May, 2008

Those who enjoy our fairly walkable downtown might want to visit the Sun poll asking if Ottawa needs another pedestrian bridge. Of course, the poll doesn’t really matter, but you might as well. And it’s an excuse for you ogle the Sunshine girl. Hat tip: Gawp.

Blake Batson has said on his blog that he wants to “float ideas on how to improve our system that others will be free to vet or claim them as their own.” In that spirit, I’d like to present my first suggestion for our pals in the City of Ottawa: intensification.

Our city was supposed to be squeezed into the Greenbelt. But since this 60s, development has occurred outside the Greenbelt and our city has been surrounded by a fluffy pink tutu of sprawl. Looking at a Statistics Canada map of population density around Ottawa, we see that the population per square kilometre is mostly in the 500-2999 person range. Only in the core does the population rise beyond 5000 ppl/km2. Worryingly, looking at the population change map between 2001 and 2006, we see that the population outside the Greenbelt is growing quickly, while the population in the no man’s land between exurbia and downtown is shrinking.

Given the received wisdom that city services (water delivery, sewage disposal, transit) work best in dense urban areas, Ottawa should be looking to the orange areas on that map to lower their cost per taxpayer.

Happily, I’m not the only person suggesting this. The transit experts hired by the city to evaluate our transit plan said the same thing: our suburbs need higher densities to make rail transit a viable option. In a surprising moment of lucidity, the city’s own transportation committee endorsed the idea of improving density along the new light rail route.

Our current transportation plan isn’t very different from what we have today. Hopefully, if City council can keep focused on building a more urban city, we can look at a much better transit scenario in 2031.

Thanks to Blake Batson for the idea of this series.

The web is full of data. Many, many websites are thin scripts pulling data out of a database, formatting it in HTML, and presenting it in a way that is (hopefully) human readable. In an ideal world these websites would provide an API that would give programmers a way of sucking that information into their own creations. Sadly, banks, libraries, online stores, and other data providers aren’t that altruistic. They are only going to provide APIs when their customers demand it.

Now imagine that we could impose an API onto websites. We could say “every bank will provide this API to give access to your financial history.” Obviously banks aren’t going to posse up and agree on a common interface, so we programmers will have to do it ourselves. Enter the idea of a “portable scraper.”

A portable scraper is a script that will consume one or more web pages, parse them, normalize the data, and then present the transformed in a well defined schema. That sounds like a scraper, right? Where does the whole portable thing come in?

Portability would allow programmers to write a scraper once and share it with other coders. The scraper would be

  1. Well defined, meaning that the inputs (ie, credentials and parameters) are easy to read, and the output is in a format easily consumable by other programs
  2. Language neutral, meaning that it would run in any host interpreter (eg: Python, Perl, PHP, Java, Ruby, JavaScript, etc)
  3. Auditable, meaning that programmers can read a scraper quickly and get a good idea of what it’s doing
  4. Secure, meaning that programmers have a reasonable assurance that a scraper isn’t leaking sensitive data to unauthorized third parties

I’m essentially suggesting a domain specific language that runs in a virtual machine. The language defines how to get information out of some website, while the virtual machine limits who the scraper can talk to.

Let’s look at an example. In some alternate universe, I have an account with the PiePalace National Bank (PPNB). Every week, I want to receive an SMS message telling me if I’m spending beyond my budget. Since the PPNB doesn’t provide an API for their customers (shame!), I have to use a portable scraper to pull my account history from the PPNB website. Happily, some other programmer has already faced this problem and has published a portable scraper to do the job. It looks like:

// Tell our caller which website we need to be able to access.
// The scraper won't be able to access pages outside of this hierarchy
require access 'https://natbank.pp/login.php'; // Login page
require access 'https://accounts.natbank.pp/*'; // Portion of the website providing acct history
	
// Tell the caller that we need certain parameters
require input 'bankCardNumber'; // The user's debit card number
require input 'password'; // The user's password
require input 'accountNumber'; // The number of the account, as shown on PPNB web pages
	
// Provide an interface for the return value. Quasi-BNF.
export output {
  HISTORY = row*; // We must provide a history element that has zero or more rows
  row = String[4]; // Indicate that each row has exactly four string elements
};
	
// Start at the login page to get our cookies
$browser = new Browser();
$browser get 'https://natbank.pp/login.php';
$form = $browser chooseform 'LoginForm';
$form set 'username' $INPUT{'bankCardNumber'};
$form set 'password' $INPUT{'password'};
$browser submit $form;
	
// Follow a link to our account history
$browser follow ($browser chooseLink $INPUT{'accountNumber'});
	
// Consume the account history
$toReturn = new Array();
	
while (true) {
  $table = $browser chooseTable '#acctHistory';
	
  $table runOnEachRow [ $row |
    // Parse each row in the table containing our account history
    $date = $row get 0;
    $amount = $row get 1;
    $who = $row get 2;
    $balanceAfter = $row get 3;
	
    $toReturn push new Array($date, $amount, $who, $balanceAfter);
  ];
	
  // Follow the link to the next page, (if it exists)
  $nextPageLink = $browser chooseLink 'Next Page';
  if ($nextPageLink == nil) {
    break;
  }
	
  // Read the entire history
  $browser follow $nextPageLink;
}
	
return $toReturn;

(The above is kinda Smalltalk: each line starts with an object reference, followed by the method to call. Each line is terminated by a ‘;’. if, while, break behave as you’d expect. Closures are defined between [], with parameters passed in to the first value.)

Notice that the script is written in a domain-specific language to handle the HTMLisms of the data being parsed. It contains a preamble that states which websites the script will need to visit (which are enforced by the VM), the input parameters, and the output format. A programmer using the scraper just has to properly call the script and use its return values.

Use Cases

Here are a few examples of scenarios that a portable scrapers could excel at:

  1. Securely finding the current balance on a bank account
  2. Querying the price of an item in an online store
  3. Querying a website for a list of upcoming events
  4. Querying OC Transpo for a bus schedule
  5. Querying the OC Transpo Travel Planner for a travel plan between two points

Portable Scrapers vs. Plager

When I floated this idea to dave0, he rightly asked “how is this different from Plagger?” As far as I can tell, Plagger is intended to be a processor for RSS feeds: it sequentially runs a series of plugins on a blackboard that contains at least one RSS feed. Portable scrapers would differ in that:

  1. they would handle arbitrary data types. Whereas Plagger (only?) dumps an RSS feed, a portable scraper would hand arbitrary data back to the caller.
  2. Plagger is the application. As far as I can tell, it doesn’t accept parameters (outside those written into its config files), and it isn’t designed to pass data back to another process. In other words, plagger is intended to do the full computation, where a portable scraper is only intended to get data for further processing.
  3. Plagger modules are fully trusted. There is no programmatic mechanism to stop a Plagger plugin from leaking data (either through files or across the network).

Comments?

I’m taking a photojournalism course at SPAO. Last class our instructor told us to get used to shooting in a few different modes. His suggestions: for indoor, use a default of F4, for 1/60 of a second; for outdoor use F8 for 1/250 of a second.
So we aren’t at the point where we Diamond Age-style fabricators, it’s neat to see online fabrication. I’d seen eMachineshop a couple of years back, but their prices were prohibitive for my crappy creations. I hadn’t heard about Ponoko – it’s similar to eMachineshop, but it seems to specialize in more consumer-grade materials.