Tagged: hbase

USPS AIS bulk data loading with Hadoop mapreduce

Today I pushed up some source to Github for a utility I was previously working on to load data from USPS AIS data files into HBase/Mysql using Hadoop mapreduce and simpler data loaders. Source @ https://github.com/bitsofinfo/usps-ais-data-loader

This project was originally started to create a framework for loading data files from the USPS AIS suite of data products (zipPlus4, cityState). The project has not been worked on in a while but I figured I’d open-source it and maybe some folks would like to team up to work on it further, if so let me know!¬†Throwing it out there under the Apache 2.0 license. Some of the libs need updating etc as well, for instance it was originally developed w/ Spring 2.5.

USPS AIS data files are fixed length format records. This framework was created to handle bulk loading/updating this data into a structured/semi-structured data store of address data (i.e. MySql or HBase). It is wired together using Spring and built w/ Maven. A key package is the “org.bitsofinfo.util.address.usps.ais” package which defines the pojos for the records, and leverages a custom annotation which binds record properties to locations within the fixed length records which contain the data being loaded.

Initial loader implementations include both a single JVM multi-threaded version as well as a second one that leverages Hadoop Mapreduce to split the AIS files up across HDFS and process them in parallel using Hadoop mapreduce nodes to ingest the data much faster then just on one box. Both of these obviously operate asynchronously given a load job submission. Ingestion times are significantly faster using Hadoop.

This project also had a need for a Hadoop InputFormat/RecordReader that could read from fixed length data files (none existed), so I created it for this project (FixedLengthInputFormat). This was also contributed as a patch to the Hadoop project. This source is included in here and updated for Hadoop 0.23.1 (not yet tested), however the patch that was submitted to the Hadoop project is still pending and was compiled under 0.20.x. The 0.20.x version in the patch files was tested and functionally running on a 4 node Hadoop and Hbase cluster.

You can read more about the fixed length record reader patch @

https://bitsofinfo.wordpress.com/2009/11/01/reading-fixed-length-width-input-record-reader-with-hadoop-mapreduce/

https://issues.apache.org/jira/browse/MAPREDUCE-1176 

The USPS AIS products have some sample data-sets available online at the USPS website, however for the full product of data-files you need to pay for the data and/or subscription for delta updates. Some of the unit-tests reference files from the real data-sets, they have been omitted, you will have to replace them with the real ones. Other unit tests reference the sample files freely available via USPS or other providers.

Links where USPS data files can be purchased:

https://www.usps.com/business/address-information-systems.htm

http://www.zipinfo.com/products/natzip4/natzip4.htm

HBase examples on OS-X and Maven

Ok, so today I needed to get HBase 0.20.0 running on my local os-x box, simply in standalone mode. I am starting a project where I need to manage 50-100 million records and I wanted to try out HBase.

Here are the steps I took, the steps below are a consolidation of some pointers found in the HBase and Hadoop quick start guides.

A) Download HBase 0.20.X (currently 0.20.0), extract and install to /my/dir/hbase

B) Make sure your shell environment is setup to point to your Java 1.6 Home and your PATH is setup correctly which should be something like:

export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home
export HBASE_HOME=/my/dir/hbase
PATH=$PATH:$HBASE_HOME/bin:$JAVA_HOME:bin
export PATH

C) Even though we are running in standalone mode. HBase is built on top of Hadoop and Hadoop uses SSH to communicate with masters/slaves. So we need to make sure the process can ssh to the localhost without a passphrase. (My standalone setup of HBase would not start properly without this).

Lets check to see if you can SSH locally without a password. Type ssh localhost. If this fails, we need to permit this so run the following two commands:

ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
ssh localhost (you should be able to connect now)

D) Ok, at this point you should be able to fire up HBase, lets do the following:

/my/dir/hbase/bin/start-hbase.sh

Once started up type: /my/dir/hbase/bin/hbase shell this brings up the interactive console, sort of like the mysql console where you can directly interact with the database or in this case, not a database, but the HBase KV store. While on the console type status 'detailed'. If you get successful output we are good to go! Hbase is running, type exit to get back to the bash shell. Lets leave HBase running

MAVEN INTEGRATION

Ok, now we need to setup your Java classpath to include the HBase jars. They are all located at /my/dir/hbase/lib. If you are using Maven and want to get HBase configured in your project. You can use the following UN-OFFICIAL HBase Maven POM and deploy script listed below. These files were originally provided by Fivecloud's post located here and I upgraded them for HBase 0.20.0.

Deploy Script For Maven Dependancies

#! /bin/sh
#
# Deploy all HBase dependencies which are not available via the official
#	 maven repository at http://repo1.maven.org.
#
#
# This is for HBase 0.20.0
#
# Modified for HBase 0.20.0 from the original located at
# http://www.fiveclouds.com/2009/04/13/deploying-hbase-to-your-local-maven-repo/
#
# The maven repository to deploy to.
#

REPOSITORY_URL=file:///$HOME/.m2/repository

if [ -z $HBASE_HOME ]; then
	echo "Error: HBASE_HOME is not set." 2>&1
	exit 1
fi

HBASE_LIBDIR=$HBASE_HOME/lib


# HBase
#
mvn deploy:deploy-file -Dfile=$HBASE_HOME/hbase-0.20.0.jar \
	-DpomFile=hbase.pom -Durl=$REPOSITORY_URL

#Hadoop
mvn deploy:deploy-file -DgroupId=org.apache -DartifactId=hadoop \
	-Dversion=0.20.0 -Dpackaging=jar -Durl=$REPOSITORY_URL \
	-Dfile=$HBASE_LIBDIR/hadoop-0.20.0-plus4681-core.jar

#thrift
mvn deploy:deploy-file -DgroupId=com.facebook -DartifactId=thrift \
	-Dversion=r771587 -Dpackaging=jar -Durl=$REPOSITORY_URL \
	-Dfile=$HBASE_LIBDIR/libthrift-r771587.jar

#apache commons cli
mvn deploy:deploy-file -DgroupId=commons-cli -DartifactId=commons-cli \
	-Dversion=2.0-SNAPSHOT -Dpackaging=jar -Durl=$REPOSITORY_URL \
	-Dfile=$HBASE_LIBDIR/commons-cli-2.0-SNAPSHOT.jar
	
#zookeeper
mvn deploy:deploy-file -DgroupId=org.apache.hadoop -DartifactId=zookeeper \
	-Dversion=r785019-hbase-1329 -Dpackaging=jar -Durl=$REPOSITORY_URL \
	-Dfile=$HBASE_LIBDIR/zookeeper-r785019-hbase-1329.jar


# EOF

Unofficial "hbase.pom"

<?xml version="1.0" encoding="UTF-8"?>

<project>
  <modelVersion>4.0.0</modelVersion>
  <groupId>org.apache.hadoop</groupId>
  <artifactId>hbase</artifactId>
  <packaging>jar</packaging>
  <version>0.20.0</version>

  <name>Hadoop HBase</name>

  <dependencies>
  
    <dependency> 
      <groupId>org.apache.hadoop</groupId>
      <artifactId>zookeeper</artifactId>
      <version>r785019-hbase-1329</version>
    </dependency>
  
    <dependency>
      <groupId>commons-cli</groupId>
      <artifactId>commons-cli</artifactId>
      <version>2.0-SNAPSHOT</version>
    </dependency>
    <dependency>
      <groupId>commons-collections</groupId>
      <artifactId>commons-collections</artifactId>
      <version>3.2</version>
    </dependency>
    <dependency>
      <groupId>commons-httpclient</groupId>
      <artifactId>commons-httpclient</artifactId>
      <version>3.0.1</version>
    </dependency>
    <dependency>
      <groupId>commons-logging</groupId>
      <artifactId>commons-logging</artifactId>
      <version>1.1.1</version>
    </dependency>
    <dependency>
      <groupId>commons-math</groupId>
      <artifactId>commons-math</artifactId>
      <version>1.1</version>
    </dependency>

    <dependency>
      <groupId>org.apache</groupId>
      <artifactId>hadoop</artifactId>
      <version>0.20.0</version>
    </dependency>

    <dependency>
      <groupId>log4j</groupId>
      <artifactId>log4j</artifactId>
      <version>1.2.13</version>
    </dependency>

    <dependency>
      <groupId>jetty</groupId>
      <artifactId>org.mortbay.jetty</artifactId>
      <version>5.1.4</version>
    </dependency>
    <dependency>
      <groupId>jline</groupId>
      <artifactId>jline</artifactId>
      <version>0.9.91</version>
    </dependency>
    <dependency>
      <groupId>com.facebook</groupId>
      <artifactId>thrift</artifactId>
      <version>r771587</version>
    </dependency>
    <dependency>
      <groupId>org.apache.lucene</groupId>
      <artifactId>lucene-core</artifactId>
      <version>2.2.0</version>
    </dependency>
    <dependency>
      <groupId>log4j</groupId>
      <artifactId>log4j</artifactId>
      <version>1.2.15</version>
    </dependency>
    <dependency>
      <groupId>xmlenc</groupId>
      <artifactId>xmlenc</artifactId>
      <version>0.52</version>
    </dependency>
    <dependency>
	    <groupId>org.apache.geronimo.specs</groupId>
	    <artifactId>geronimo-j2ee_1.4_spec</artifactId>
	    <version>1.0</version>
	    <scope>provided</scope>
    </dependency>

  </dependencies>

	<repositories>
		<repository>
			<id>virolab.cyfronet.pl</id>
			<name>virolab.cyfronet.pl (used for commons-cli-2.0)</name>
			<url>http://virolab.cyfronet.pl/maven2</url>
		</repository>
	</repositories>

</project>

LETS ACTUALLY USE HBASE...

F) Now it's time to fire up a Java app to do some basic HBase operations. The examples below are simple tests that are setup to run in JUnit within Spring on my box, so you can ignore the method names and the @Test annotations, as the meat of the examples are in the method bodies.

CREATE A TABLE EXAMPLE

Ok, HBase is NOT an RDBMS but a Big Table implementation. Think of it as a giant Hashtable with more advanced features. However for the purposes of this example I will speak in terms of Rows/Columns etc which are similar in concept to that of a database and are what most folks are familiar with.

The MOST important thing is that you COPY the /my/dir/hbase/conf/*.xml default HBase configuration files to someplace on your classpath. These files can be customized, but for straight out of the box testing they work as-is. Just MAKE SURE they are on your classpath before starting as HBaseConfiguration instances look for them there.

HBaseConfiguration config = new HBaseConfiguration(); 
HBaseAdmin admin = null;
try {
	// HBaseAdmin is where all the "DDL" like operations take place in HBase
	admin = new HBaseAdmin(config);
} catch(MasterNotRunningException e) {
	throw new Exception("Could not setup HBaseAdmin as no master is running, did you start HBase?...");
}

if (!admin.tableExists("testTable")) {
	admin.createTable(new HTableDescriptor("testTable"));
	
	// disable so we can make changes to it
	admin.disableTable("testTable");
	
	// lets add 2 columns
	admin.addColumn("testTable", new HColumnDescriptor("firstName"));
	admin.addColumn("testTable", new HColumnDescriptor("lastName"));
	
	// enable the table for use
	admin.enableTable("testTable");

}

	
// get the table so we can use it in the next set of examples
HTable table = new HTable(config, "testTable");

After running the above code fire up the HBase shell /my/dir/hbase/bin/hbase shell and once up, type "list" at the hbase shell prompt, you should see your testTable listed! Yeah!

ADD A ROW TO THE TABLE

// lets put a new object with a unique "row" identifier, this is the key
// HBase stores everything in bytes so you need to convert string to bytes
Put row = new Put(Bytes.toBytes("myID"));

/* lets start adding data to this row. The first parameter
is the "familyName" which essentially is the column name, the second
parameter is the qualifier, think of it as a way to subqualify values
within a particular column. For now we won't so we just make the qualifier
name the same as the column name. The last parameter is the actual value
to store */

row.add(Bytes.toBytes("firstName"),Bytes.toBytes("firstName"),Bytes.toBytes("joe"));
row.add(Bytes.toBytes("lastName"),Bytes.toBytes("lastName"),Bytes.toBytes("smith"));

try {
	// add it!
	table.put(row);
} catch(Exception e) {
    // handle me!
}

Ok, now go back to the HBase shell and type count 'testTable' and you should get one record accounted for. Good to go!

GET A ROW FROM THE TABLE

// a GET fetches a row by it's identifier key
Get get = new Get(Bytes.toBytes("myID"));

Result result = null;
try {
	// exec the get
	 result = table.get(get);
} catch(Exception e) {
	// handle me
}

// not found??
if (result == null) {
	// NOT FOUND!
}

// Again the result speaks in terms of a familyName(column)
// and a qualifier, since ours our both the same, we pass the same
// value for both
byte[] firstName = Bytes.toBytes("firstName");
byte[] lastName = Bytes.toBytes("lastName");
byte[] fnameVal = result.getValue(firstName,firstName);
byte[] lnameVal = result.getValue(lastName,lastName);

System.out.println(new String(fnameVal) + " " + new String(lnameVal));

DELETE A ROW FROM THE TABLE

Delete d = new Delete(Bytes.toBytes("myID"));

try {
	table.delete(d);
} catch(Exception e) {
	// handle me
}

Now fire up the hbase shell again and type "count 'testTable'", you should now get zero rows.

OK, well I hope that helped you get up and running with some HBase basics!