Pull Parsing ARIN WHOIS Data in PHP

At JerseyConnect, we’ve been using the Bulk WHOIS Data from ARIN for a few years now, mostly as a diagnostic tool for libraries concerned about their bandwidth usage, but also for capacity planning and judging peering options. After years of using the flat text format dumps, this week I decided to take the plunge and change to the much larger — but much better structured — XML formatted files. Here’s how I did it.

TL;DR? Skip to the solution!

The Old Way – The Flat Text Dump

ARIN bulk WHOIS data was originally available only as a flat text dump from their database. This format made it easy to convert into individual records into a data structure, as it came in a pseudo-tab delimited format. But it lacked any structure, and you had to convert each record and inspect its properties to determine its type. This slowed down processing significantly.

I had written a parser for the flat text dump format when we first started using the data and it worked well enough that we’ve stuck with it. However, not every record in the flat text file has every key, and some fields are used differently between records, so it had to be built with a maze of filters and regular expressions and ended up an unmanageable mess with sections like this:

/** create or re-create any missing fields */
 
/** sometimes street address is its own field, but sometimes not */
if(!array_key_exists('Street',$record)) {
	if(array_key_exists('City',$record)) {
		if(preg_match('/(.*?)[\s]{2,}(.*?)/',$record['City'])) {
			/** if there is a clear break, use it to delineate city from street */
			$addr = preg_split('/[\s]{2,}/',$record['City']);
			$record['Street'] = $addr[0];
			$record['City'] = $addr[1];
		} else {
			/** 
			 * if we can't tell just grab everything but the last word
			 * This misses cases like "Los Angeles" and "Palm Beach" but it will have to do
			 */
			$record['Street'] = substr($record['City'],0,strrpos($record['City'],' '));
			$record['City'] = substr($record['City'],strrpos($record['City'],' '));
		}
	} else {
		/** if there is no data, just put in blanks */
		$record['Street'] = '';
		$record['City'] = '';
	}
}
 
if(!array_key_exists('Company',$record) && strpos($record['Street'],"\n") !== FALSE) {
	$record['Company'] = substr($record['Street'],0,strpos($record['Street'],"\n"));
	$record['Street'] = substr($record['Street'],strpos($record['Street'],"\n")+1);
}
if(!array_key_exists('FirstName',$record)) {
	if(array_key_exists('LastName',$record)) {
 
		if(strpos($record['LastName'],',') !== FALSE) {
			$name = explode(',',$record['LastName']);
			$record['LastName'] = trim($name[0]);
			$record['FirstName'] = trim($name[1]);
		} else {
			$record['FirstName'] = '';
		}
	}
}

The New Way – XML Format

In XML, the bulk WHOIS data is huge — at over 3 GB and growing by over 25 MB per month, it is about 3x the size of the flat file — but is far more consistent in terms of structure. This makes writing the parser MUCH more straightforward. The biggest problem is the sheer size, and for that we have XMLReader.

Parsing a Large XML File

Since the file is so large, we need to read it piece by piece. Fortunately, XMLReader makes this simple. No more chunking with fread() or fget* variants! While this post is about parsing ARIN data, not general XML, here’s an example of how easy it is to get started with XMLReader:

$xmlHandle = new XMLReader();
$xmlStatus = $xmlHandle->open( 'any_xml_file.xml' );
while($xmlHandle->read()) {
	/** any processing on current node */
	echo $xmlHandle->nodeValue;
}
if($xmlStatus) $xmlHandle->close();

The modest code above will scan an XML file of any size without running out of memory. Of course, depending on how deeply nested your data is, it may take a VERY long time.

For more information on XML processing with PHP, see this excellent IBM series, “XML for PHP developers” (part 1) (part 2).

Parsing the ARIN WHOIS XML File

Our parser needs to be tailored to the ARIN WHOIS data. We want it to locate the four types of record (ASN, NET, ORG, and POC) that may be in the file we’re scanning and hand them off to subordinate functions as efficiently as possible.

Here’s the finished skeleton parser:

ARIN XML Parser – Skeleton
/**
 * ARIN Bulk XML Data Parser class
 */
class Parse_ARIN {
 
	var $xmlStatus = false;
	var $xmlHandle;
	var $count;
	var $startTime = null;
 
	/**
	 * @param string XMLFile File containing bulk whois data in XML format
	 */
	function __construct( $xmlFile = null ) {
		$this->xmlHandle = new XMLReader();
		if(ob_get_level() == 0) ob_start();
		if(empty($xmlFile)) {
			return true;
		} else if(!is_file($xmlFile)) {
			return false;
		}
		$this->count = 0;
 
		$this->xmlStatus = $this->xmlHandle->open( $xmlFile );
		return $this->xmlStatus;
	}
 
	function __destruct() {
		if($this->xmlStatus) $this->xmlHandle->close();
 
		/** Report time taken */
		$endTime = microtime(true);
		echo 'Processed ' . $this->count . ' records in: ' . ($endTime - $this->startTime) . ' seconds.';
	}
 
	public function setSource( $xmlFile ) {
		if(is_file($xmlFile)) {
			$this->xmlStatus = $this->xmlHandle->open( $xmlFile );
			return $this->xmlStatus;
		}
		return false;
	}
 
	public function process() {
		if(!$this->xmlStatus) {
			return false;
		}
		$this->startTime = microtime(true);
		$this->scan();
	}
 
	/**
	 * Echo progress to the browser and continue execution
	 * @param string Message Message to be echoed
	 */
	public function report( $message ) {
		echo str_pad($message . '<br />', 4096) . "\n";
		ob_flush();
		flush();
	}
 
	protected function scan() {
		while($this->xmlHandle->read()) {
			if(in_array( $this->xmlHandle->name, array('asn','net','poc','org') ) && $this->xmlHandle->nodeType == XMLREADER::ELEMENT) {
				switch($this->xmlHandle->name) {
					case 'asn':
						$this->saveASNRecord( $this->xmlHandle->expand() );
						break;
					case 'net':
						$this->saveNETRecord( $this->xmlHandle->expand() );
						break;
					case 'org':
						$this->saveORGRecord( $this->xmlHandle->expand() );
						break;
					case 'poc':
						$this->savePOCRecord( $this->xmlHandle->expand() );
						break;
				}
				$this->count++;
				if($this->count % 10000 == 0) {
					$this->report('Processed: ' . $this->count . ' records.');
				}
				$this->xmlHandle->next();
			}
		}
	}
 
	/**
	 * Convert the DOMNode into a SimpleXML object for manipulation
	 * @param DOMNode RecordData XML data from the file being scanned
	 * @return SimpleXMLElement
	 */
	protected function processRecord( $recordData ) {
 
		/** From Listing 9: http://www.ibm.com/developerworks/xml/library/x-xmlphp2/index.html */
		$dom = new DomDocument();
		$n = $dom->importNode($recordData,true);
		$dom->appendChild($n);
		$result = simplexml_import_dom($n);
		return $result;
	}
 
	protected function saveASNRecord( $asnNode ) {
		$record = $this->processRecord( $asnNode );
 
	}
 
	protected function saveNETRecord( $netNode ) {
		$record = $this->processRecord( $netNode );
 
	}
 
	protected function saveORGRecord( $orgNode ) {
		$record = $this->processRecord( $orgNode );
 
	}
 
	protected function savePOCRecord( $pocNode ) {
		$record = $this->processRecord( $pocNode );
 
	}
}

Calling the ARIN Parser

The class we created keeps the surface area as small as possible. We can now use the parser with just a few lines of code.

Calling the ARIN Parser – Sample Code
$ARIN = new Parse_ARIN('arin_db.xml');
 
$ARIN->report('Starting process...');
$ARIN->process();

Processing the XML Data

In order to do something useful with the data, like store it in a database or display it to the screen, you can alter the save{TYPE}Record functions (or extend the parser) to suit your needs. Here’s an example:

protected function savePOCRecord( $pocNode ) {
 
	$record = $this->processRecord( $pocNode );
 
	/**
	 * DB structure:
	 * POCHandle (PK)
	 * IsRole
	 * Company
	 * LastName
	 * FirstName
	 * RoleName
	 * Street
	 * City
	 * State
	 * Country
	 * PostalCode
	 * RegDate
	 * Updated
	 * OfficePhone
	 * Mailbox
	 */
 
	$POCrecord = array(
		'POCHandle'		=> (string)$record->handle,
		'IsRole'		=> (int)($record->isRoleAccount == 'Y'),
		'Company'		=> (string)$record->companyName,
		'LastName'		=> (string)$record->lastName,
		'FirstName'		=> (string)$record->firstName,
		'RoleName'		=> '',
		'Street'		=> '',
		'City'			=> (string)$record->city,
		'State'			=> (string)$record->{'iso3166-2'},
		'Country'		=> (string)$record->{'iso3166-1'}->code2,
		'PostalCode'		=> (string)$record->postalCode,
		'RegDate'		=> (string)$record->registrationDate,
		'Updated'		=> (string)$record->updateDate,
		'OfficePhone'		=> (isset($record->phones->phone) ? (string)$record->phones->phone->number->phoneNumber : ''),
		'Mailbox'		=> (string)$record->emails->email
	);
 
	/** create or re-create any missing fields */
	if(count($record->streetAddress->line) > 1) {
		$streetAddress = get_object_vars($record->streetAddress);
		if(empty($POCrecord['companyName']))
			$POCrecord['Company'] = array_shift($streetAddress['line']);
		$POCrecord['Street'] = join("\n",$streetAddress['line']);
	} else if(count($record->streetAddress->line) == 1) {
		$POCrecord['Street'] = (string)$record->streetAddress->line;
	}
 
	$POC_DB->add($POCrecord);
}

Other Notes – Processing Speed

Finally, this data set is big enough that I thought a quick analysis might be a useful contribution to the age-old if/else vs. switch debate.

Here are the results for iterating over the approx. 4.6 million records in this file without further processing. Tests were run on Win7/PHP 5.3.5:

Method Min. time (seconds) Max. time (seconds)
No Test 141.1 141.6
If/Else 149.1 150.2
Switch 147.7 147.7

Copyright 2011, David Dean. All rights reserved. The code on this page is provided under the terms of the Modified BSD License.