Tagged: NOSQL

Astyanax -> Cassandra PoolTimeoutException during Authentication failure?

Recently I was working on implementing a custom IAuthenticator and IAuthority for Cassandra 1.1.1 because really there is not much/any security out of the box. For those of you familiar with Cassandra, its distribution used to include a simple property file based implementation of the IAuthentication and IAuthority that you could reference in your cassandra.yaml file however they removed them from the main distribution and placed them under the examples/ section due to weak security concerns. They are a decent starting point to reference when building your own implementations however they are not recommended for real production use; hence why I started to implement my own.

Doing this, I came across a situation trying to use the Netflix Astyanax client API to talk to Cassandra, while Cassandra was running with th custom IAuthenticator and IAuthorities that I made. When testing the initializations of connections to Cassandra, while specifying invalid credentials (intentionally) instead of seeing some sort of AuthenticationException dumped to my client Astyanax log file, I was getting “PoolTimeoutException“s instead…. which was odd. I scratched my head on this for a while as I cloned Astyanax from GitHub and began digging into the source. I suspected that the Thrift AuthenticationException might  be supressed somewhere…. well after reading the source, I realized it wasn’t being suppressed per-say, but rather sent to Astyanax’s ConnectionPoolMonitor, which is something you can configure programatically when you are defining your client code’s AstyanaxContext object which manages all connectivity to Cassandra. Out of the box Astyanax ships with a few ConnectionPoolMonitor implementations, one is the CountingConnectionPoolMonitor (does no logging, just collects stats) and the second is the Slf4jConnectionPoolMonitorImpl (logs to SLF4J). Depending on which one you specify in your context’s configuration you may or may not see AuthenticationException information in your client’s logs/console.

In my case, I was specifying the CountingConnectionPoolMonitor which was receiving the AuthenticationException, but not doing anything with it other than incrementing some counter, effectivly hiding it from me. The pool ran out of connections (could not create any) and the code waiting on getting a connection just threw a PoolTimeoutException, adding to my confusion.

To correct this, as I was using Log4J, I just created a custom ConnectionPoolMonitor which logged everything to Log4J instead. (@see Astyanax’s SLF4J monitor implementation as an example for how to create one for Log4j) See below for how to specify the monitor. Creating your own ConnectionPoolMonitor implementation is easy and pretty self explanatory.

Below is an example of setting up an AstyanaxContext and how you specify the ConnectionPoolMonitor that should be used. Once I used the correct monitor for my needs, I was able to see the true source of the PoolTimeoutExceptions (i.e. the AuthenticationExceptions) because now my monitor was logging them. (NOTE: the example below is just a test context, not something for a robust setup)


AstyanaxContext context = new AstyanaxContext.Builder()
 .forCluster(clusterName)
 .forKeyspace(keyspaceName)
 .withAstyanaxConfiguration(new AstyanaxConfigurationImpl()
 .setDiscoveryType(NodeDiscoveryType.NONE)
 )

.withConnectionPoolConfiguration(new ConnectionPoolConfigurationImpl(clusterName+"-"+keyspaceName+"_CONN_POOL")
 .setPort(defaultConnectionPoolHostPort)
 .setInitConnsPerHost(1)
 .setMaxConnsPerHost(2)
 .setSeeds(connectionPoolSeedHosts)
 .setAuthenticationCredentials(
 new SimpleAuthenticationCredentials(new String(principal), new String(credentials)))
 )
 .withConnectionPoolMonitor(new Log4jConnPoolMonitor())
 .buildKeyspace(ThriftFamilyFactory.getInstance());

context.start();

USPS AIS bulk data loading with Hadoop mapreduce

Today I pushed up some source to Github for a utility I was previously working on to load data from USPS AIS data files into HBase/Mysql using Hadoop mapreduce and simpler data loaders. Source @ https://github.com/bitsofinfo/usps-ais-data-loader

This project was originally started to create a framework for loading data files from the USPS AIS suite of data products (zipPlus4, cityState). The project has not been worked on in a while but I figured I’d open-source it and maybe some folks would like to team up to work on it further, if so let me know! Throwing it out there under the Apache 2.0 license. Some of the libs need updating etc as well, for instance it was originally developed w/ Spring 2.5.

USPS AIS data files are fixed length format records. This framework was created to handle bulk loading/updating this data into a structured/semi-structured data store of address data (i.e. MySql or HBase). It is wired together using Spring and built w/ Maven. A key package is the “org.bitsofinfo.util.address.usps.ais” package which defines the pojos for the records, and leverages a custom annotation which binds record properties to locations within the fixed length records which contain the data being loaded.

Initial loader implementations include both a single JVM multi-threaded version as well as a second one that leverages Hadoop Mapreduce to split the AIS files up across HDFS and process them in parallel using Hadoop mapreduce nodes to ingest the data much faster then just on one box. Both of these obviously operate asynchronously given a load job submission. Ingestion times are significantly faster using Hadoop.

This project also had a need for a Hadoop InputFormat/RecordReader that could read from fixed length data files (none existed), so I created it for this project (FixedLengthInputFormat). This was also contributed as a patch to the Hadoop project. This source is included in here and updated for Hadoop 0.23.1 (not yet tested), however the patch that was submitted to the Hadoop project is still pending and was compiled under 0.20.x. The 0.20.x version in the patch files was tested and functionally running on a 4 node Hadoop and Hbase cluster.

You can read more about the fixed length record reader patch @

https://bitsofinfo.wordpress.com/2009/11/01/reading-fixed-length-width-input-record-reader-with-hadoop-mapreduce/

https://issues.apache.org/jira/browse/MAPREDUCE-1176 

The USPS AIS products have some sample data-sets available online at the USPS website, however for the full product of data-files you need to pay for the data and/or subscription for delta updates. Some of the unit-tests reference files from the real data-sets, they have been omitted, you will have to replace them with the real ones. Other unit tests reference the sample files freely available via USPS or other providers.

Links where USPS data files can be purchased:

https://www.usps.com/business/address-information-systems.htm

http://www.zipinfo.com/products/natzip4/natzip4.htm

How to access your OpenShift MongoDB database remotely on OS-X

I recently started playing around with Redhat’s Openshift PaaS and installed the MongoDB and RockMongo cartridges on my application. My use case was just to leverage the Openshift platform to run my MongoDB instance for me, and I really was ready (nor needing) to push an actual application out to the application running @ openshift; instead I just wanted my local Java program to leverage the remote MongoDB instance, pump some data into it and then view it in Rockmongo (also running on the app at openshift).

Turns out you can enable this by enabling port forwarding locally on the computer you want to connect from. This is on OS-X:

  • First ensure that your SSH key is available for use by the command we will run: “ssh-add /path/to/your/openshift/ssh.key” This should be the key that you created when you initially signed up for Openshift
  • Attempt to run the port forwarding enable command “rhc port-forward -a [yourAppName] -l [yourOpenShiftLoginId]“, it will prompt you for your openshift credentials, then will likely output something like the below:

Checking available ports...

Binding httpd -> 127.5.198.2:8080…
Binding mongod -> 127.5.198.1:27017…

Use ctl + c to stop

bind: Can’t assign requested address
channel_setup_fwd_listener: cannot listen to port: 27017
bind: Can’t assign requested address
channel_setup_fwd_listener: cannot listen to port: 8080
Could not request local forwarding.

  • The above error is because we need to now enable loopback aliases for each failed 127.x address listed above (those addresses will be different in your output above). To do this run the following for EACH loopback that failed: “sudo ifconfig lo0 alias 127.x.x.x“. Note that even after running this, make sure you already don’t have something else local bound on those ports (i.e. a local dev mongodb instance or something else on 8080)
  • After you add the loopback aliases re-run the port forwarding command above, this time it should successfully complete. If you go to https://127.x.x.x:8080 your request will be going directly to your openshift instance.
  • You can also now connect to your mongodb instance programmatically as well just using the new address:ports forwarded above plus the appropriate credentials
  • Interestingly enough the port forwarding goes over an SSH tunnel to your app there, and it appears that Openshift itself uses AWS!
Some relevant links to help you out which also helped me:

Reading fixed length/width input records with Hadoop mapreduce

While working on a project where I needed to quickly import 50-100 million records I ended up using Hadoop for the job. Unfortunately the input files I was dealing with were fixed width/length records, hence they had no delimiters which separated records, nor did they have any CR/LFs to separate records. Each record was exactly 502 bytes in size. Hadoop provides a TextInputFormat out of the box for reading input files, however it requires that your files contain CR/LFs or some combination thereof.

This patch source is available on GitHub @ https://github.com/bitsofinfo/hadoopSandbox/tree/master/fixedLengthInputFormat

So…. I went ahead a wrote a couple of classes to support fixed length, fixed width (same thing) records in input files. These classes were inspired by Hadoop’s TextInputFormat and LineRecordReader. The two classes are FixedLengthInputFormat and FixedLengthRecordReader, they are presented below. I have also created a Hadoop JIRA issue to contribute these classes to the Hadoop project.

This input format overrides computeSplitSize() in order to ensure that InputSplits do not contain any partial records since with fixed records there is no way to determine where a record begins if that were to occur. Each InputSplit passed to the FixedLengthRecordReader will start at the beginning of a record, and the last byte in the InputSplit will be the last byte of a record. The override of computeSplitSize() delegates to FileInputFormat’s compute method, and then adjusts the returned split size by doing the following: (Math.floor(fileInputFormatsComputedSplitSize / fixedRecordLength) * fixedRecordLength)

FixedLengthInputFormat does NOT support compressed files. To use this input format, you do so as follows:

// setup your job configuration etc
...

// be sure to set the length of your fixed length records, so the
// FixedLengthRecordReader can extract the records correctly.
myJobConf.setInt(FixedLengthInputFormat.FIXED_RECORD_LENGTH, 502);

// OR alternatively you can set it this way, the name of the property is
// "mapreduce.input.fixedlengthinputformat.record.length"
myJobConf.setInt("mapreduce.input.fixedlengthinputformat.record.length",502);

// create your job
Job job = new Job(myJobConf);
job.setInputFormatClass(FixedLengthInputFormat.class);

// do the rest of your job setup, specifying input locations etc
...

myJob.submit();

Below are the two classes which you are free to use. Hope this helps you out if you have a need to read fixed width/length records out of input files using Hadoop MapReduce! Enjoy.

FixedLengthInputFormat.javadownload

package org.bitsofinfo.hadoop.mapreduce.lib.input;

import java.io.IOException;

import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.CompressionCodecFactory;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.JobContext;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

/**
 * FixedLengthInputFormat is an input format which can be used
 * for input files which contain fixed length records with NO
 * delimiters and NO carriage returns (CR, LF, CRLF) etc. Such
 * files typically only have one gigantic line and each "record"
 * is of a fixed length, and padded with spaces if the record's actual
 * value is shorter than the fixed length.


 *
 * Users must configure the record length property before submitting
 * any jobs which use FixedLengthInputFormat.


 *
 * myJobConf.setInt("mapreduce.input.fixedlengthinputformat.record.length",[myFixedRecordLength]);


 *
 * This input format overrides <code>computeSplitSize()</code> in order to ensure
 * that InputSplits do not contain any partial records since with fixed records
 * there is no way to determine where a record begins if that were to occur.
 * Each InputSplit passed to the FixedLengthRecordReader will start at the beginning
 * of a record, and the last byte in the InputSplit will be the last byte of a record.
 * The override of <code>computeSplitSize()</code> delegates to FileInputFormat's
 * compute method, and then adjusts the returned split size by doing the following:
 * <code>(Math.floor(fileInputFormatsComputedSplitSize / fixedRecordLength) * fixedRecordLength)</code>
 *
 *


 * This InputFormat returns a FixedLengthRecordReader.


 *
 * Compressed files currently are not supported.
 *
 * @see	FixedLengthRecordReader
 *
 * @author bitsofinfo.g (AT) gmail.com
 *
 */
public class FixedLengthInputFormat extends FileInputFormat {

	/**
	 * When using FixedLengthInputFormat you MUST set this
	 * property in your job configuration to specify the fixed
	 * record length.
	 *


	 *
	 * i.e. myJobConf.setInt("mapreduce.input.fixedlengthinputformat.record.length",[myFixedRecordLength]);
	 */
	public static final String FIXED_RECORD_LENGTH = "mapreduce.input.fixedlengthinputformat.record.length";

	// our logger reference
	private static final Log LOG = LogFactory.getLog(FixedLengthInputFormat.class);

	// the default fixed record length (-1), error if this does not change
	private int recordLength = -1;

	/**
	 * Return the int value from the given Configuration found
	 * by the FIXED_RECORD_LENGTH property.
	 *
	 * @param config
	 * @return	int record length value
	 * @throws IOException if the record length found is 0 (non-existant, not set etc)
	 */
	public static int getRecordLength(Configuration config) throws IOException {
		int recordLength = config.getInt(FixedLengthInputFormat.FIXED_RECORD_LENGTH, 0);

		// this would be an error
		if (recordLength == 0) {
			throw new IOException("FixedLengthInputFormat requires the Configuration property:" + FIXED_RECORD_LENGTH + " to" +
					" be set to something > 0. Currently the value is 0 (zero)");
		}

		return recordLength;
	}

	/**
	 * This input format overrides <code>computeSplitSize()</code> in order to ensure
	 * that InputSplits do not contain any partial records since with fixed records
	 * there is no way to determine where a record begins if that were to occur.
	 * Each InputSplit passed to the FixedLengthRecordReader will start at the beginning
	 * of a record, and the last byte in the InputSplit will be the last byte of a record.
	 * The override of <code>computeSplitSize()</code> delegates to FileInputFormat's
	 * compute method, and then adjusts the returned split size by doing the following:
	 * <code>(Math.floor(fileInputFormatsComputedSplitSize / fixedRecordLength) * fixedRecordLength)</code>
	 *
	 * @inheritDoc
	 */
	@Override
	protected long computeSplitSize(long blockSize, long minSize, long maxSize) {
		long defaultSize = super.computeSplitSize(blockSize, minSize, maxSize);

		// 1st, if the default size is less than the length of a
		// raw record, lets bump it up to a minimum of at least ONE record length
		if (defaultSize 			return recordLength;
		}

		// determine the split size, it should be as close as possible to the
		// default size, but should NOT split within a record... each split
		// should contain a complete set of records with the first record
		// starting at the first byte in the split and the last record ending
		// with the last byte in the split.

		long splitSize = ((long)(Math.floor((double)defaultSize / (double)recordLength))) * recordLength;
		LOG.info("FixedLengthInputFormat: calculated split size: " + splitSize);

		return splitSize;

	}

	/**
	 * Returns a FixedLengthRecordReader instance
	 *
	 * @inheritDoc
	 */
	@Override
	public RecordReader createRecordReader(InputSplit split,
			TaskAttemptContext context) throws IOException, InterruptedException {
		return new FixedLengthRecordReader();
	}

	/**
	 * @inheritDoc
	 */
 	@Override
 	protected boolean isSplitable(JobContext context, Path file) {

 		try {
			if (this.recordLength == -1) {
				this.recordLength = getRecordLength(context.getConfiguration());
			}
			LOG.info("FixedLengthInputFormat: my fixed record length is: " + recordLength);

 		} catch(Exception e) {
 			LOG.error("Error in FixedLengthInputFormat.isSplitable() when trying to determine the fixed record length, returning false, input files will NOT be split!",e);
 			return false;
 		}

 		CompressionCodec codec = new CompressionCodecFactory(context.getConfiguration()).getCodec(file);
	 	if (codec != null) {
	 		return false;
	 	}

	 	return true;
	 }

}

FixedLengthRecordReader.javadownload

package org.bitsofinfo.hadoop.mapreduce.lib.input;

import java.io.IOException;

import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.Seekable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.CompressionCodecFactory;
import org.apache.hadoop.mapreduce.Counter;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.MapContext;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;

/**
 *
 * FixedLengthRecordReader is returned by FixedLengthInputFormat. This reader
 * uses the record length property set within the FixedLengthInputFormat to
 * read one record at a time from the given InputSplit. This record reader
 * does not support compressed files.


 *
 * Each call to nextKeyValue() updates the LongWritable KEY and Text VALUE.


 *
 * KEY = byte position in the file the record started at

 * VALUE = the record itself (Text)
 *
 *
 * @author bitsofinfo.g (AT) gmail.com
 *
 */
public class FixedLengthRecordReader extends RecordReader {

	// reference to the logger
	private static final Log LOG = LogFactory.getLog(FixedLengthRecordReader.class);

	// the start point of our split
	private long splitStart;

	// the end point in our split
	private long splitEnd;

	// our current position in the split
	private long currentPosition;

	// the length of a record
	private int recordLength;

	// reference to the input stream
	private FSDataInputStream fileInputStream;

	// the input byte counter
	private Counter inputByteCounter;

	// reference to our FileSplit
	private FileSplit fileSplit;

	// our record key (byte position)
	private LongWritable recordKey = null;

	// the record value
	private Text recordValue = null;

	@Override
	public void close() throws IOException {
		if (fileInputStream != null) {
			fileInputStream.close();
		}
	}

	@Override
	public LongWritable getCurrentKey() throws IOException,
			InterruptedException {
		return recordKey;
	}

	@Override
	public Text getCurrentValue() throws IOException, InterruptedException {
		return recordValue;
	}

	@Override
	public float getProgress() throws IOException, InterruptedException {
		if (splitStart == splitEnd) {
			return (float)0;
		} else {
			return Math.min((float)1.0, (currentPosition - splitStart) / (float)(splitEnd - splitStart));
		}
	}

	@Override
	public void initialize(InputSplit inputSplit, TaskAttemptContext context)
			throws IOException, InterruptedException {

		// the file input fileSplit
		this.fileSplit = (FileSplit)inputSplit;

		// the byte position this fileSplit starts at within the splitEnd file
		splitStart = fileSplit.getStart();

		// splitEnd byte marker that the fileSplit ends at within the splitEnd file
		splitEnd = splitStart + fileSplit.getLength();

		// log some debug info
		LOG.info("FixedLengthRecordReader: SPLIT START="+splitStart + " SPLIT END=" +splitEnd + " SPLIT LENGTH="+fileSplit.getLength() );

		// the actual file we will be reading from
		Path file = fileSplit.getPath();

		// job configuration
		Configuration job = context.getConfiguration();

		// check to see if compressed....
		CompressionCodec codec = new CompressionCodecFactory(job).getCodec(file);
	 	if (codec != null) {
	 		throw new IOException("FixedLengthRecordReader does not support reading compressed files");
	 	}

		// for updating the total bytes read in
	 	inputByteCounter = ((MapContext)context).getCounter("FileInputFormatCounters", "BYTES_READ");

	 	// THE JAR COMPILED AGAINST 0.20.1 does not contain a version of FileInputFormat with these constants (but they exist in trunk)
	 	// uncomment the below, then comment or discard the line above
	 	//inputByteCounter = ((MapContext)context).getCounter(FileInputFormat.COUNTER_GROUP, FileInputFormat.BYTES_READ);

		// the size of each fixed length record
		this.recordLength = FixedLengthInputFormat.getRecordLength(job);

		// get the filesystem
		final FileSystem fs = file.getFileSystem(job);

		// open the File
		fileInputStream = fs.open(file,(64 * 1024));

		// seek to the splitStart position
		fileInputStream.seek(splitStart);

		// set our current position
	 	this.currentPosition = splitStart;
	}

	@Override
	public boolean nextKeyValue() throws IOException, InterruptedException {
		if (recordKey == null) {
		 	recordKey = new LongWritable();
	 	}

		// the Key is always the position the record starts at
	 	recordKey.set(currentPosition);

	 	// the recordValue to place the record text in
	 	if (recordValue == null) {
	 		recordValue = new Text();
	 	} else {
	 		recordValue.clear();
	 	}

	 	// if the currentPosition is less than the split end..
	 	if (currentPosition < splitEnd) {

	 		// setup a buffer to store the record
	 		byte[] buffer = new byte[this.recordLength];
	 		int totalRead = 0; // total bytes read
	 		int totalToRead = recordLength; // total bytes we need to read

	 		// while we still have record bytes to read
	 		while(totalRead != recordLength) {
	 			// read in what we need
	 			int read = this.fileInputStream.read(buffer, 0, totalToRead);

	 			// append to the buffer
	 			recordValue.append(buffer,0,read);

	 			// update our markers
	 			totalRead += read;
	 			totalToRead -= read;
	 			//LOG.info("READ: just read=" + read +" totalRead=" + totalRead + " totalToRead="+totalToRead);
	 		}

	 		// update our current position and log the input bytes
	 		currentPosition = currentPosition +recordLength;
	 		inputByteCounter.increment(recordLength);

	 		//LOG.info("VALUE=|"+fileInputStream.getPos()+"|"+currentPosition+"|"+splitEnd+"|" + recordLength + "|"+recordValue.toString());

	 		// return true
	 		return true;
	 	}

	 	// nothing more to read....
		return false;
	}

}

HBase examples on OS-X and Maven

Ok, so today I needed to get HBase 0.20.0 running on my local os-x box, simply in standalone mode. I am starting a project where I need to manage 50-100 million records and I wanted to try out HBase.

Here are the steps I took, the steps below are a consolidation of some pointers found in the HBase and Hadoop quick start guides.

A) Download HBase 0.20.X (currently 0.20.0), extract and install to /my/dir/hbase

B) Make sure your shell environment is setup to point to your Java 1.6 Home and your PATH is setup correctly which should be something like:

export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home
export HBASE_HOME=/my/dir/hbase
PATH=$PATH:$HBASE_HOME/bin:$JAVA_HOME:bin
export PATH

C) Even though we are running in standalone mode. HBase is built on top of Hadoop and Hadoop uses SSH to communicate with masters/slaves. So we need to make sure the process can ssh to the localhost without a passphrase. (My standalone setup of HBase would not start properly without this).

Lets check to see if you can SSH locally without a password. Type ssh localhost. If this fails, we need to permit this so run the following two commands:

ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
ssh localhost (you should be able to connect now)

D) Ok, at this point you should be able to fire up HBase, lets do the following:

/my/dir/hbase/bin/start-hbase.sh

Once started up type: /my/dir/hbase/bin/hbase shell this brings up the interactive console, sort of like the mysql console where you can directly interact with the database or in this case, not a database, but the HBase KV store. While on the console type status 'detailed'. If you get successful output we are good to go! Hbase is running, type exit to get back to the bash shell. Lets leave HBase running

MAVEN INTEGRATION

Ok, now we need to setup your Java classpath to include the HBase jars. They are all located at /my/dir/hbase/lib. If you are using Maven and want to get HBase configured in your project. You can use the following UN-OFFICIAL HBase Maven POM and deploy script listed below. These files were originally provided by Fivecloud's post located here and I upgraded them for HBase 0.20.0.

Deploy Script For Maven Dependancies

#! /bin/sh
#
# Deploy all HBase dependencies which are not available via the official
#	 maven repository at http://repo1.maven.org.
#
#
# This is for HBase 0.20.0
#
# Modified for HBase 0.20.0 from the original located at
# http://www.fiveclouds.com/2009/04/13/deploying-hbase-to-your-local-maven-repo/
#
# The maven repository to deploy to.
#

REPOSITORY_URL=file:///$HOME/.m2/repository

if [ -z $HBASE_HOME ]; then
	echo "Error: HBASE_HOME is not set." 2>&1
	exit 1
fi

HBASE_LIBDIR=$HBASE_HOME/lib


# HBase
#
mvn deploy:deploy-file -Dfile=$HBASE_HOME/hbase-0.20.0.jar \
	-DpomFile=hbase.pom -Durl=$REPOSITORY_URL

#Hadoop
mvn deploy:deploy-file -DgroupId=org.apache -DartifactId=hadoop \
	-Dversion=0.20.0 -Dpackaging=jar -Durl=$REPOSITORY_URL \
	-Dfile=$HBASE_LIBDIR/hadoop-0.20.0-plus4681-core.jar

#thrift
mvn deploy:deploy-file -DgroupId=com.facebook -DartifactId=thrift \
	-Dversion=r771587 -Dpackaging=jar -Durl=$REPOSITORY_URL \
	-Dfile=$HBASE_LIBDIR/libthrift-r771587.jar

#apache commons cli
mvn deploy:deploy-file -DgroupId=commons-cli -DartifactId=commons-cli \
	-Dversion=2.0-SNAPSHOT -Dpackaging=jar -Durl=$REPOSITORY_URL \
	-Dfile=$HBASE_LIBDIR/commons-cli-2.0-SNAPSHOT.jar
	
#zookeeper
mvn deploy:deploy-file -DgroupId=org.apache.hadoop -DartifactId=zookeeper \
	-Dversion=r785019-hbase-1329 -Dpackaging=jar -Durl=$REPOSITORY_URL \
	-Dfile=$HBASE_LIBDIR/zookeeper-r785019-hbase-1329.jar


# EOF

Unofficial "hbase.pom"

<?xml version="1.0" encoding="UTF-8"?>

<project>
  <modelVersion>4.0.0</modelVersion>
  <groupId>org.apache.hadoop</groupId>
  <artifactId>hbase</artifactId>
  <packaging>jar</packaging>
  <version>0.20.0</version>

  <name>Hadoop HBase</name>

  <dependencies>
  
    <dependency> 
      <groupId>org.apache.hadoop</groupId>
      <artifactId>zookeeper</artifactId>
      <version>r785019-hbase-1329</version>
    </dependency>
  
    <dependency>
      <groupId>commons-cli</groupId>
      <artifactId>commons-cli</artifactId>
      <version>2.0-SNAPSHOT</version>
    </dependency>
    <dependency>
      <groupId>commons-collections</groupId>
      <artifactId>commons-collections</artifactId>
      <version>3.2</version>
    </dependency>
    <dependency>
      <groupId>commons-httpclient</groupId>
      <artifactId>commons-httpclient</artifactId>
      <version>3.0.1</version>
    </dependency>
    <dependency>
      <groupId>commons-logging</groupId>
      <artifactId>commons-logging</artifactId>
      <version>1.1.1</version>
    </dependency>
    <dependency>
      <groupId>commons-math</groupId>
      <artifactId>commons-math</artifactId>
      <version>1.1</version>
    </dependency>

    <dependency>
      <groupId>org.apache</groupId>
      <artifactId>hadoop</artifactId>
      <version>0.20.0</version>
    </dependency>

    <dependency>
      <groupId>log4j</groupId>
      <artifactId>log4j</artifactId>
      <version>1.2.13</version>
    </dependency>

    <dependency>
      <groupId>jetty</groupId>
      <artifactId>org.mortbay.jetty</artifactId>
      <version>5.1.4</version>
    </dependency>
    <dependency>
      <groupId>jline</groupId>
      <artifactId>jline</artifactId>
      <version>0.9.91</version>
    </dependency>
    <dependency>
      <groupId>com.facebook</groupId>
      <artifactId>thrift</artifactId>
      <version>r771587</version>
    </dependency>
    <dependency>
      <groupId>org.apache.lucene</groupId>
      <artifactId>lucene-core</artifactId>
      <version>2.2.0</version>
    </dependency>
    <dependency>
      <groupId>log4j</groupId>
      <artifactId>log4j</artifactId>
      <version>1.2.15</version>
    </dependency>
    <dependency>
      <groupId>xmlenc</groupId>
      <artifactId>xmlenc</artifactId>
      <version>0.52</version>
    </dependency>
    <dependency>
	    <groupId>org.apache.geronimo.specs</groupId>
	    <artifactId>geronimo-j2ee_1.4_spec</artifactId>
	    <version>1.0</version>
	    <scope>provided</scope>
    </dependency>

  </dependencies>

	<repositories>
		<repository>
			<id>virolab.cyfronet.pl</id>
			<name>virolab.cyfronet.pl (used for commons-cli-2.0)</name>
			<url>http://virolab.cyfronet.pl/maven2</url>
		</repository>
	</repositories>

</project>

LETS ACTUALLY USE HBASE...

F) Now it's time to fire up a Java app to do some basic HBase operations. The examples below are simple tests that are setup to run in JUnit within Spring on my box, so you can ignore the method names and the @Test annotations, as the meat of the examples are in the method bodies.

CREATE A TABLE EXAMPLE

Ok, HBase is NOT an RDBMS but a Big Table implementation. Think of it as a giant Hashtable with more advanced features. However for the purposes of this example I will speak in terms of Rows/Columns etc which are similar in concept to that of a database and are what most folks are familiar with.

The MOST important thing is that you COPY the /my/dir/hbase/conf/*.xml default HBase configuration files to someplace on your classpath. These files can be customized, but for straight out of the box testing they work as-is. Just MAKE SURE they are on your classpath before starting as HBaseConfiguration instances look for them there.

HBaseConfiguration config = new HBaseConfiguration(); 
HBaseAdmin admin = null;
try {
	// HBaseAdmin is where all the "DDL" like operations take place in HBase
	admin = new HBaseAdmin(config);
} catch(MasterNotRunningException e) {
	throw new Exception("Could not setup HBaseAdmin as no master is running, did you start HBase?...");
}

if (!admin.tableExists("testTable")) {
	admin.createTable(new HTableDescriptor("testTable"));
	
	// disable so we can make changes to it
	admin.disableTable("testTable");
	
	// lets add 2 columns
	admin.addColumn("testTable", new HColumnDescriptor("firstName"));
	admin.addColumn("testTable", new HColumnDescriptor("lastName"));
	
	// enable the table for use
	admin.enableTable("testTable");

}

	
// get the table so we can use it in the next set of examples
HTable table = new HTable(config, "testTable");

After running the above code fire up the HBase shell /my/dir/hbase/bin/hbase shell and once up, type "list" at the hbase shell prompt, you should see your testTable listed! Yeah!

ADD A ROW TO THE TABLE

// lets put a new object with a unique "row" identifier, this is the key
// HBase stores everything in bytes so you need to convert string to bytes
Put row = new Put(Bytes.toBytes("myID"));

/* lets start adding data to this row. The first parameter
is the "familyName" which essentially is the column name, the second
parameter is the qualifier, think of it as a way to subqualify values
within a particular column. For now we won't so we just make the qualifier
name the same as the column name. The last parameter is the actual value
to store */

row.add(Bytes.toBytes("firstName"),Bytes.toBytes("firstName"),Bytes.toBytes("joe"));
row.add(Bytes.toBytes("lastName"),Bytes.toBytes("lastName"),Bytes.toBytes("smith"));

try {
	// add it!
	table.put(row);
} catch(Exception e) {
    // handle me!
}

Ok, now go back to the HBase shell and type count 'testTable' and you should get one record accounted for. Good to go!

GET A ROW FROM THE TABLE

// a GET fetches a row by it's identifier key
Get get = new Get(Bytes.toBytes("myID"));

Result result = null;
try {
	// exec the get
	 result = table.get(get);
} catch(Exception e) {
	// handle me
}

// not found??
if (result == null) {
	// NOT FOUND!
}

// Again the result speaks in terms of a familyName(column)
// and a qualifier, since ours our both the same, we pass the same
// value for both
byte[] firstName = Bytes.toBytes("firstName");
byte[] lastName = Bytes.toBytes("lastName");
byte[] fnameVal = result.getValue(firstName,firstName);
byte[] lnameVal = result.getValue(lastName,lastName);

System.out.println(new String(fnameVal) + " " + new String(lnameVal));

DELETE A ROW FROM THE TABLE

Delete d = new Delete(Bytes.toBytes("myID"));

try {
	table.delete(d);
} catch(Exception e) {
	// handle me
}

Now fire up the hbase shell again and type "count 'testTable'", you should now get zero rows.

OK, well I hope that helped you get up and running with some HBase basics!