Archive for tag "trusted computing"

The web is full of data. Many, many websites are thin scripts pulling data out of a database, formatting it in HTML, and presenting it in a way that is (hopefully) human readable. In an ideal world these websites would provide an API that would give programmers a way of sucking that information into their own creations. Sadly, banks, libraries, online stores, and other data providers aren’t that altruistic. They are only going to provide APIs when their customers demand it.

Now imagine that we could impose an API onto websites. We could say “every bank will provide this API to give access to your financial history.” Obviously banks aren’t going to posse up and agree on a common interface, so we programmers will have to do it ourselves. Enter the idea of a “portable scraper.”

A portable scraper is a script that will consume one or more web pages, parse them, normalize the data, and then present the transformed in a well defined schema. That sounds like a scraper, right? Where does the whole portable thing come in?

Portability would allow programmers to write a scraper once and share it with other coders. The scraper would be

  1. Well defined, meaning that the inputs (ie, credentials and parameters) are easy to read, and the output is in a format easily consumable by other programs
  2. Language neutral, meaning that it would run in any host interpreter (eg: Python, Perl, PHP, Java, Ruby, JavaScript, etc)
  3. Auditable, meaning that programmers can read a scraper quickly and get a good idea of what it’s doing
  4. Secure, meaning that programmers have a reasonable assurance that a scraper isn’t leaking sensitive data to unauthorized third parties

I’m essentially suggesting a domain specific language that runs in a virtual machine. The language defines how to get information out of some website, while the virtual machine limits who the scraper can talk to.

Let’s look at an example. In some alternate universe, I have an account with the PiePalace National Bank (PPNB). Every week, I want to receive an SMS message telling me if I’m spending beyond my budget. Since the PPNB doesn’t provide an API for their customers (shame!), I have to use a portable scraper to pull my account history from the PPNB website. Happily, some other programmer has already faced this problem and has published a portable scraper to do the job. It looks like:

// Tell our caller which website we need to be able to access.
// The scraper won't be able to access pages outside of this hierarchy
require access 'https://natbank.pp/login.php'; // Login page
require access 'https://accounts.natbank.pp/*'; // Portion of the website providing acct history
	
// Tell the caller that we need certain parameters
require input 'bankCardNumber'; // The user's debit card number
require input 'password'; // The user's password
require input 'accountNumber'; // The number of the account, as shown on PPNB web pages
	
// Provide an interface for the return value. Quasi-BNF.
export output {
  HISTORY = row*; // We must provide a history element that has zero or more rows
  row = String[4]; // Indicate that each row has exactly four string elements
};
	
// Start at the login page to get our cookies
$browser = new Browser();
$browser get 'https://natbank.pp/login.php';
$form = $browser chooseform 'LoginForm';
$form set 'username' $INPUT{'bankCardNumber'};
$form set 'password' $INPUT{'password'};
$browser submit $form;
	
// Follow a link to our account history
$browser follow ($browser chooseLink $INPUT{'accountNumber'});
	
// Consume the account history
$toReturn = new Array();
	
while (true) {
  $table = $browser chooseTable '#acctHistory';
	
  $table runOnEachRow [ $row |
    // Parse each row in the table containing our account history
    $date = $row get 0;
    $amount = $row get 1;
    $who = $row get 2;
    $balanceAfter = $row get 3;
	
    $toReturn push new Array($date, $amount, $who, $balanceAfter);
  ];
	
  // Follow the link to the next page, (if it exists)
  $nextPageLink = $browser chooseLink 'Next Page';
  if ($nextPageLink == nil) {
    break;
  }
	
  // Read the entire history
  $browser follow $nextPageLink;
}
	
return $toReturn;

(The above is kinda Smalltalk: each line starts with an object reference, followed by the method to call. Each line is terminated by a ‘;’. if, while, break behave as you’d expect. Closures are defined between [], with parameters passed in to the first value.)

Notice that the script is written in a domain-specific language to handle the HTMLisms of the data being parsed. It contains a preamble that states which websites the script will need to visit (which are enforced by the VM), the input parameters, and the output format. A programmer using the scraper just has to properly call the script and use its return values.

Use Cases

Here are a few examples of scenarios that a portable scrapers could excel at:

  1. Securely finding the current balance on a bank account
  2. Querying the price of an item in an online store
  3. Querying a website for a list of upcoming events
  4. Querying OC Transpo for a bus schedule
  5. Querying the OC Transpo Travel Planner for a travel plan between two points

Portable Scrapers vs. Plager

When I floated this idea to dave0, he rightly asked “how is this different from Plagger?” As far as I can tell, Plagger is intended to be a processor for RSS feeds: it sequentially runs a series of plugins on a blackboard that contains at least one RSS feed. Portable scrapers would differ in that:

  1. they would handle arbitrary data types. Whereas Plagger (only?) dumps an RSS feed, a portable scraper would hand arbitrary data back to the caller.
  2. Plagger is the application. As far as I can tell, it doesn’t accept parameters (outside those written into its config files), and it isn’t designed to pass data back to another process. In other words, plagger is intended to do the full computation, where a portable scraper is only intended to get data for further processing.
  3. Plagger modules are fully trusted. There is no programmatic mechanism to stop a Plagger plugin from leaking data (either through files or across the network).

Comments?

Pascal Meunier has written an essay about loyalty in software. It’s a riff on the idea of trusted computing (and the resulting crippled software), which asks about software’s loyalty. Is the software loyal to its user (as it should be for personal use), or is it loyal to its producer/distributer? The brief discussion of loyalty in free software interesting. It would be interesting if loyalty could be quantified or expressed somehow. I’d like to be able to tag stuff that I write with a loyalty signature. Update: Thanks to dave0 for pointing out that I’d failed to include a link. Now I do.