The web is full of data. Many, many websites are thin scripts pulling data out of a database, formatting it in HTML, and presenting it in a way that is (hopefully) human readable. In an ideal world these websites would provide an API that would give programmers a way of sucking that information into their own creations. Sadly, banks, libraries, online stores, and other data providers aren’t that altruistic. They are only going to provide APIs when their customers demand it.
Now imagine that we could impose an API onto websites. We could say “every bank will provide this API to give access to your financial history.” Obviously banks aren’t going to posse up and agree on a common interface, so we programmers will have to do it ourselves. Enter the idea of a “portable scraper.”
A portable scraper is a script that will consume one or more web pages, parse them, normalize the data, and then present the transformed in a well defined schema. That sounds like a scraper, right? Where does the whole portable thing come in?
Portability would allow programmers to write a scraper once and share it with other coders. The scraper would be
- Well defined, meaning that the inputs (ie, credentials and parameters) are easy to read, and the output is in a format easily consumable by other programs
- Language neutral, meaning that it would run in any host interpreter (eg: Python, Perl, PHP, Java, Ruby, JavaScript, etc)
- Auditable, meaning that programmers can read a scraper quickly and get a good idea of what it’s doing
- Secure, meaning that programmers have a reasonable assurance that a scraper isn’t leaking sensitive data to unauthorized third parties
I’m essentially suggesting a domain specific language that runs in a virtual machine. The language defines how to get information out of some website, while the virtual machine limits who the scraper can talk to.
Let’s look at an example. In some alternate universe, I have an account with the PiePalace National Bank (PPNB). Every week, I want to receive an SMS message telling me if I’m spending beyond my budget. Since the PPNB doesn’t provide an API for their customers (shame!), I have to use a portable scraper to pull my account history from the PPNB website. Happily, some other programmer has already faced this problem and has published a portable scraper to do the job. It looks like:
require access 'https://natbank.pp/login.php';
require access 'https://accounts.natbank.pp/*';
require input 'bankCardNumber';
require input 'password';
require input 'accountNumber';
export output {
HISTORY = row*;
row = String[4];
};
$browser = new Browser();
$browser get 'https://natbank.pp/login.php';
$form = $browser chooseform 'LoginForm';
$form set 'username' $INPUT{'bankCardNumber'};
$form set 'password' $INPUT{'password'};
$browser submit $form;
$browser follow ($browser chooseLink $INPUT{'accountNumber'});
$toReturn = new Array();
while (true) {
$table = $browser chooseTable '#acctHistory';
$table runOnEachRow [ $row |
$date = $row get 0;
$amount = $row get 1;
$who = $row get 2;
$balanceAfter = $row get 3;
$toReturn push new Array($date, $amount, $who, $balanceAfter);
];
$nextPageLink = $browser chooseLink 'Next Page';
if ($nextPageLink == nil) {
break;
}
$browser follow $nextPageLink;
}
return $toReturn;
(The above is kinda Smalltalk: each line starts with an object reference, followed by the method to call. Each line is terminated by a ‘;’. if, while, break behave as you’d expect. Closures are defined between [], with parameters passed in to the first value.)
Notice that the script is written in a domain-specific language to handle the HTMLisms of the data being parsed. It contains a preamble that states which websites the script will need to visit (which are enforced by the VM), the input parameters, and the output format. A programmer using the scraper just has to properly call the script and use its return values.
Use Cases
Here are a few examples of scenarios that a portable scrapers could excel at:
- Securely finding the current balance on a bank account
- Querying the price of an item in an online store
- Querying a website for a list of upcoming events
- Querying OC Transpo for a bus schedule
- Querying the OC Transpo Travel Planner for a travel plan between two points
Portable Scrapers vs. Plager
When I floated this idea to dave0, he rightly asked “how is this different from Plagger?” As far as I can tell, Plagger is intended to be a processor for RSS feeds: it sequentially runs a series of plugins on a blackboard that contains at least one RSS feed. Portable scrapers would differ in that:
- they would handle arbitrary data types. Whereas Plagger (only?) dumps an RSS feed, a portable scraper would hand arbitrary data back to the caller.
- Plagger is the application. As far as I can tell, it doesn’t accept parameters (outside those written into its config files), and it isn’t designed to pass data back to another process. In other words, plagger is intended to do the full computation, where a portable scraper is only intended to get data for further processing.
- Plagger modules are fully trusted. There is no programmatic mechanism to stop a Plagger plugin from leaking data (either through files or across the network).
Comments?