ProgrammingHave: data

Tuesday, June 13, 2017

NodeJS with HBase via HBase Thrift2 Part 1: Connect

Motivation

In each their own, both NodeJS and HBase are power full tools. NodeJS for spining efficient apis up fast, and HBase for holding large amount of data(in somewhat it's own way). More important HBase also solves the small file issue on Hadoop. So combining them can make sense. But it is fairly not documented.

HBase comes with a REST API and a Thrift API. Where the Thrift API is the most efficient, despite that the REST API is returning instantiated javascript objects (hence JSON). The reason is, Thrift is utilizing binary transmission and which more compact than JSON which is utilized by the REST API. There is an older github page with some benchmarking: https://github.com/stelcheck/node-hbase-vs-thrift

The at-the-time-writing, the latest stable version of HBase is version 1.2.6, it has 2 Thrift interfaces, called: HBase Thrift and HBase Thrift2. HBase Thrift is more general/administrative purpose, where tables can be created, deleted and data manipulated. The stuff I prefer to do in the HBase Shell, and not from a service. HBase Thrift2 is data only, CRUD and even batch operations which are not found in HBase Thrift.

HBase Part

To make this post complete, we'll go from table creation in HBase, to a connection to it, from NodeJS.

Table creation in HBase from the HBase shell

 create_namespace 'foo'  
 create 'foo:bar', 'family1'

Start HBase Thrift2 API from OS shell

 bin/hbase-daemon.sh start thrift2

NB! By default HBase Thrift and HBase Thrift2 are setup to use port 9095 and 9090. I you want them to run concurrent, it is possible set custom port numbers for the APIs

NB! HBase Thrift API can crash due to lack of heap memory, the heap memory can be increased in the config file: conf/hbase-env.sh

 # The maximum amount of heap to use. Default is left to JVM default.  
 export HBASE_HEAPSIZE=8G

Good to go

NodeJS part

Pre-requisites, besides from having NodeJS installed, is the Thrift compiler and the HBase Thrift definition file. A Thrift definition file acts as a documentation file and a definition file for building service/client proxies.

Thrift compiler can be found on Apache's Thrift homepage: https://thrift.apache.org/

HBase Thrift definition file can be found in the HBase source package from the HBase homepage: https://hbase.apache.org/

Start the NodeJS project and add the Thrift package

 mkdir node_hbase  
 cd node_hbase  
 npm init  
 npm install thrift

Create the proxy client package from the HBase Thrift definition file

 thrift-0.10.0.exe --gen js:node hbase-1.2.6-src\hbase-1.2.6\hbase-thrift\src\main\resources\org\apache\hadoop\hbase\thrift2\Hbase.thrift

Create the index.js file (you can call what ever you want)

 var thrift = require('thrift');  
 var HBaseService = require('./gen-nodejs/THBaseService.js');  
 var HBaseTypes = require('./gen-nodejs/HBase_types.js');  
 var connection = thrift.createConnection('IP or DNS to your HBase server', 9090); 
 
 connection.on('connect', function () {  
   var client = thrift.createClient(HBaseService, connection);  
   client.getAllRegionLocations('foo:bar', function (err, data) {  
     if (err) {  
       console.log('error:', err);  
     } else {  
       console.log('All region locations for table:' + JSON.stringify(data));  
     }  
     connection.end();  
   });  
 });
  
 connection.on('error', function (err) {  
   console.log('error:', err);  
 });

Run the js script and get some result

 node index.js  
 All region locations for table:[{"serverName":{"hostName":"localhost","port":49048,"startCode":{"buffer":{"type":"Buffer","data":[0,0,1,92,160,234,132,254]},"offset":0}},"regionInfo":{"regionId":{"buffer":{"type":"Buffer","data":[0,0,0,0,0,0,0,0]},"offset":0},"tableName":{"type":"Buffer","data":[102,111,111,58,98,97,114]},"startKey":{"type":"Buffer","data":[]},"endKey":{"type":"Buffer","data":[]},"offline":false,"split":false,"replicaId":0}}]

Wednesday, January 22, 2014

Data Modelling with Immutability

Level: 3 where 1 is noob and 5 is totally awesome
Disclaimer: If you use anything from any of my blog entries, it is on your own responsibility.

Intro

Data modelling with immutability, is a thing I have been thinking about for a while now, and if I had green field project I would think it in, from start. The concept is simple, instead of change states on objects, we create new ones(Well, sometimes it makes more sense to use mutable entities, but I'll get back on that). It is a concept, which makes it possible to create better data models, which resembles the things, we want to model more precisely.

Through out this blog post, I'll use the following Person-Address model, when clarifying the concept. The model is simple. It resembles a person and an address for this person.

The Concept

The conception is as mention before; instead of change the state of entities in the data model, we create a new ones. As an example, let us look, at our Person-Address model. Lets say we have a Person, who decides to move, which is the same as change address. A way to model this change of address, could be updating the address entity with the new address. But, not only from a modelling perspective is this wrong, but it can also give some technical challenges later on. Such querying the person former addressed, can be difficult.

The right thing to do, is to somehow mark the old address entity obsolete(but do not delete), create a new address entity, with the new address, and connect it to the person as the current address. Why is this right? It is right, because when a person is moving, they a moving to a new location/address. The new address has another house, while the old house is still standing at the old address. So by modelling address a immutable, we have the old address and the new address, just like in the real world. Actually, if the address entity was updated with a new address, it would be the same as saying, we are dumping a new address and a house, on the old address. Not really possible in reality, eh?

The tricky part of modelling with immutability, is identifying what should be immutable and what should be mutable. Lets take a look at the Person-Address model again. Person. No matter the name of a person, the person would ideally always be the same. Thereby the person is mutable. So if a person changes name, the person entity should be updated. The entity should not be replaced, like in the in address example. If the entity was replaced, it would be same as saying it is a new person. So thing which remains the same, no matter the state, should be mutable.

Versioning and event sourcing

First a quick definition, of versioning and event sourcing. Event sourcing, is storing the event which caused an entity to change. A version is the result of an event.

Some might think, by using immutable entities in a model, we would get event sourcing, as an positive side-effect. Well, if we have to be absolute correctly, it wouldn't be event sourcing. Because we are not storing an event, we ares storing a result of an event a.k.a. a version. We are also only getting versioning for the immutable entities, because we are creating a new entity for every change. Mutable entities state changes should, like immutable entities events, be event sourced.

By modelling this way, you should always have the most true data model. As positive side effect, you should always be able, to query versions and events for every entity in your model. Happy modelling.

www.snippettool.net - Online snippet tool for VS and SSMS