Wednesday, 1 May 2013

Screen-scraping Addresses from the USPS

As I mentioned in my posts about Web Tools, I wasn’t entirely sure I’d get approval from the Post Office to use the Web Tools servers, so I came up with a plan B. That plan B being screen-scraping the USPS’ website. Like the others, I wrote the code originally in 2010. As it happened, a site design made me recently revisit the code…and rewrite it almost entirely.

PHP is a good candidate to handle this, as it has capabilities that make screen-scraping very simple. To start, you need to assemble a URL, from which the contents will be retrieved. To start with, you need an address to be validated. In my case, I put them in a MySQL table and put the whole thing in a loop. So, I will make something of an assumption, and dip right into the screen-scraping itself.

The URL can be assembled like so:

$haxUrl = 'https://tools.usps.com/go/ZipLookupResultsAction!input.action?resultMode=0&companyName=&';
$haxMsg = 'address1=' . trim($row['street']) . '&address2=' . trim($row['suite']) . '&city=' . trim($row['city']) . '&state=' . trim($row['state']) . '&urbanCode=&postalCode=&zip=';

I’ve already returned the address from MySQL in an associative array. I’ve learned to use associative arrays where possible – it makes rewriting code to add or remove fields a very simple matter. The next thing to be done here involves reworking the URL you’ve just assembled, as the USPS’ site doesn’t like URL-encoded URLs. That can be done something like so:

$haxMsg = urlencode($haxMsg);
$haxMsg = str_replace('%3D','=',$haxMsg);
$haxMsg = str_replace('%26','&',$haxMsg);
$haxMsg = str_replace('%20','+',$haxMsg);

Next, off to the USPS with it:

$newurl = $haxUrl . $haxMsg;
// retrieves the response from the server
raw = file_get_contents($newurl);
// replaces lines breaks
if ($raw) { // gets around non-responsive server error
  $newlines = array("\t","\n","\r","\x20\x20","\0","\x0B");
  $content = str_replace($newlines, "", html_entity_decode($raw));
  $errorNotFound = strpos($content, '<li class="error">');
  $errorFoundNotFound = strpos($content, 'class="error clearfix">');
  $errorMultipleAdd = strpos($content, 'Several addresses matched the information you provided');
  $start = strpos($content, '<span class="address1 range">');
  $end = strpos($content, '<p class="show-details">');
  // chop down the result to just what's between start and end
  $valid = substr($content, $start, $end-$start);

The block continues a bit, so apologies for the open bracket without closing it. What you’re doing here is chopping the returned data down the relevant part – the address that’s been returned from the Post Office. From there, it’s just finding the address parts by their <span> tags.

if ((!$errorNotFound) && (!$errorFoundNotFound) && (!$errorMultipleAdd)) {
  $validLen = strlen($valid);
  $cityStart = strpos($valid, '<span class="city range">');
  $stateStart = strpos($valid, '<span class="state range">');
  $zip5Start = strpos($valid, '<span class="zip" style="">');
  $zip5End = strpos($valid, '<span class="hyphen">');
  $zip4Start = strpos($valid, '<span class="zip4">');

And so forth. Check for address1 and address2, add handling to split suite/room/floor/apt so it will always be on the suite line of the address, etc. Following that, in my case I write it back to another MySQL table. For my purposes, the addresses I’m validating belong to commercial accounts, and so I use the account number and department number where applicable as a ‘keys’ field, for a unique identifier. I grab it in the original query and write it back with the validated address.

In the case that an address comes back as either unrecognized or with a (frustratingly common) ‘multiple addresses exist’ error, I simply write back ‘address error’ or ‘multiple address’ in the street column of the validated table. Once I’ve pulled the data down into Excel, I can filter down to those and correct manually.

I run this script on the command line, so as to not have any time limitations. I typically put a short sleep on the loop, so as to be a little more kind to the Post Office’s servers. I used a rand(4, 30) to make it slightly less obvious that it was a script running, hitting their servers for more than 72 hours straight. When I wrote a little variation of this script without a sleep, I found that I could kick it off and have back an Excel workbook with 25 addresses in roughly 15 seconds. With a sleep of between 1 and 4 seconds, I ran over 27k addresses in just over 24 hours.

So, you can take advantage of PHP to ease the hideous tediousness that is validating addresses. However, be a little kind to the Post Office’s servers. After all, they’re not in the best sort of position now to add equipment or manpower. It could be I’ll be rewriting this code again in a couple years to scrape the site of FedEx or UPS.


Source: http://bendustries.co/wp/?p=39

Note:

Jazz Martin is experienced web scraping consultant and writes articles on screen scraping services, website scraper, Yellow Pages Scraper, amazon data scraping, yellowpages data scraping, product information scraping and yellowpages data scraping.

No comments:

Post a Comment