Tuesday, June 13, 2017

NodeJS with HBase via HBase Thrift2 Part 1: Connect

Motivation


In each their own, both NodeJS and HBase are power full tools. NodeJS for spining efficient apis up fast, and HBase for holding large amount of data(in somewhat it's own way). More important HBase also solves the small file issue on Hadoop. So combining them can make sense. But it is fairly not documented.

HBase comes with a REST API and a Thrift API. Where the Thrift API is the most efficient, despite that the REST API is returning instantiated javascript objects (hence JSON). The reason is, Thrift is utilizing binary transmission and which more compact than JSON which is utilized by the REST API. There is an older github page with some benchmarking: https://github.com/stelcheck/node-hbase-vs-thrift 

The at-the-time-writing, the latest stable version of HBase is version 1.2.6, it has 2 Thrift interfaces, called: HBase Thrift and HBase Thrift2. HBase Thrift is more general/administrative purpose, where tables can be created, deleted and data manipulated. The stuff I prefer to do in the HBase Shell, and not from a service. HBase Thrift2 is data only, CRUD and even batch operations which are not found in HBase Thrift. 

HBase Part

To make this post complete, we'll go from table creation in HBase, to a connection to it, from NodeJS.


Table creation in HBase from the HBase shell

 create_namespace 'foo'  
 create 'foo:bar', 'family1'  

Start HBase Thrift2 API from OS shell

 bin/hbase-daemon.sh start thrift2  

NB! By default HBase Thrift and HBase Thrift2 are setup to use port 9095 and 9090. I you want them to run concurrent, it is possible set custom port numbers for the APIs

NB! HBase Thrift API can crash due to lack of heap memory, the heap memory can be increased in the config file: conf/hbase-env.sh
 # The maximum amount of heap to use. Default is left to JVM default.  
 export HBASE_HEAPSIZE=8G  

Good to go

NodeJS part

Pre-requisites, besides from having NodeJS installed, is the Thrift compiler and the HBase Thrift definition file. A Thrift definition file acts as a documentation file and a definition file for building service/client proxies.

Thrift compiler can be found on Apache's Thrift homepage: https://thrift.apache.org/ 
HBase Thrift definition file can be found in the HBase source package from the HBase homepage: https://hbase.apache.org/

Start the NodeJS project and add the Thrift package

 mkdir node_hbase  
 cd node_hbase  
 npm init  
 npm install thrift  

Create the proxy client package from the HBase Thrift definition file

 thrift-0.10.0.exe --gen js:node hbase-1.2.6-src\hbase-1.2.6\hbase-thrift\src\main\resources\org\apache\hadoop\hbase\thrift2\Hbase.thrift  

Create the index.js file (you can call what ever you want)

 var thrift = require('thrift');  
 var HBaseService = require('./gen-nodejs/THBaseService.js');  
 var HBaseTypes = require('./gen-nodejs/HBase_types.js');  
 var connection = thrift.createConnection('IP or DNS to your HBase server', 9090); 
 
 connection.on('connect', function () {  
   var client = thrift.createClient(HBaseService, connection);  
   client.getAllRegionLocations('foo:bar', function (err, data) {  
     if (err) {  
       console.log('error:', err);  
     } else {  
       console.log('All region locations for table:' + JSON.stringify(data));  
     }  
     connection.end();  
   });  
 });
  
 connection.on('error', function (err) {  
   console.log('error:', err);  
 });  

Run the js script and get some result

 node index.js  
 All region locations for table:[{"serverName":{"hostName":"localhost","port":49048,"startCode":{"buffer":{"type":"Buffer","data":[0,0,1,92,160,234,132,254]},"offset":0}},"regionInfo":{"regionId":{"buffer":{"type":"Buffer","data":[0,0,0,0,0,0,0,0]},"offset":0},"tableName":{"type":"Buffer","data":[102,111,111,58,98,97,114]},"startKey":{"type":"Buffer","data":[]},"endKey":{"type":"Buffer","data":[]},"offline":false,"split":false,"replicaId":0}}]  





Sunday, March 6, 2016

Hadoop from scratch notes: Preparing a minimal CentOs Linux Hyper-V image

Motivation

Hadoop on virtual machines? These posts are describing how to setup a hadoop homelab, to get in touch with hadoop. Yet, replace 'virtual' with 'dedicated physical', then you should by on your way to build a production cluster.

Hadoop is yet another good tool in the toolbox, when working with data. Now a days Hadoop is available as cloud service, but it can be pretty expensive, and specially if you just want to train and play with Hadoop. Some vendors as Cloudera offers a single node 'play' version of Hadoop, which is a great way to start. Yet the reason I'mt writing these notes, was that I did find Cloudera very closed and slow, and also I had to use any other Virtual Machine system than Hyper-V. Also it is not that hard to set up a Hadoop node or cluster from scratch.

Not that, I have anything against e.g. Virtual Box. Even thou I see all OS'es as my play grounds, I'm in a Microsoft period(due to my current work), and thereby my Windows box is best suited for virtualization. And it does already comes with Hyper-V, and it actually works good in Windows 10(Earlier versions did lock the CPU clock cycle, and thereby disabled speed-step). I like my machines light, so I would hate to have more than one system for virtualization.

What is the goal?

The goal is to prepare a virtual machine with a minimal version of Centos Linux. The reason, I have selected Centos OS is, it is supported by Microsoft, and it is Azure certified, and when it is Azure certified, it means it can work better with Hyper-V through Hyper-V Integration Services. I could have chosen Ubuntu (Azure's Hadoop cloud solution runs on Ubuntu), but I had a challenge with a very slow apt-get, and generally did find CentOS more light weight.

When we have a fully configured virtual machine, with CentOS and Hadoop, we are going to use it as a template, for creating more Hadoop nodes.

I prefer to setup Hyper-V with PowerShell, it is good fun and practice, and it more compact than images of the GUI. If you are familiar with the Hyper-V GUI, then you should have no trouble to figure out what to press.

Before we start, make sure Hyper-V is enabled, and get CentOS from here https://www.centos.org/download/, the minimal ISO should be sufficient(CentOS 7 is currently the latest version).

A virtual switch

If you don't have a virtual switch configured in Hyper-V, you have to configure one. You are going to use it for connecting you Hadoop nodes, the internet and you working machine together. Thou the internet is optionally. Creating a so-called external virtual switch called "Virtual Switch" (Yes, I know, the creative name is striking :-) ), is done by typing following PowerShell:

New-VMSwitch -Name "Virtual Switch" -NetAdapterName "Wi-Fi" -AllowManagementOS 1

As NetAdaptorName use "Wi-Fi" or "Ethernet", depending on which NIC provides internet.

The virtual machine and disk

Often the virtual machine and the disk interpreted as one, but,  a virtual machine consist of the "Machine" and "disk image" with the OS, and further, of some data disks". We are going to create the machine and OS disk in one go. 

New-VM -Name "Hadoop01" -MemoryStartupBytes 4GB -NewVHDPath D:\VMs\Hadoop01.vhdx -NewVHDSizeBytes 10GB -SwitchName "Virtual Switch"

Memory (4 GigaBytes) and disk (10 GigaBytes) sizes are dymanic by default, but the machine is only configured with 1 CPU. It can be upgraded with:

Set-VMProcessor -VMName Hadoop01 -Count 2

Make the virtual DVD point to the download CentOS image:

Set-VMDvdDrive -VMName Hadoop01 -Path D:\Downloads\CentOS-7-x86_64-Minimal-1511.iso

Let's go:

Start-VM Hadoop01

You have to connect to the virtual machine by the Hyper-V GUI-

Installing CentOS

Press Enter. I might take a while, before reaching next step



Select your preferred language


Check that the properties circled with yellow, are correct. That will make thing easier for you in generel. The properties circled with red, are critical, so make sure to read below how to set them.

Press 'Done', that is all

Turn on the network. Failing to do this, can require you to turn it on, after every reboot.

Set the root password. Create a user for good practice.

After installation and reboot. Log in, so we can get the IP address of our new machine, by typing the following command(ifconfig is not available on CentOS minimal)

ip addr

 Note the IP address, it can be found under Eth0: 
We are not going to use the Hyper-V viewer further. It can't copy-paste between guest and host, and the proper way to connect to a Linux/Unix server is via a SSH client. I recommend Putty (http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html), but the Git Bash is just as fine.
Type in the IP and press Open.

If using the Git Bash, you can write:

ssh <ip> -l <user>

Where <IP> is the noted IP and <user> is either root or the user created earliere.

Installing/Upgrading Microsoft Linux Integration Services(LIS)

We don't have much in the CentOS minimal, and Microsoft haven't made it easy to download the LIS package without a browser.
Fortunately, it is GNU licensed, so I have made a script to get it from my GitHub account, and to install it, together with wget.

curl -O https://raw.githubusercontent.com/ChristianHenrikReich/automation-scripts/master/centos-minimal/install-hyperv-essentials.sh

chmod 755 install-hyperv-essentials.sh

sudo ./install-hyperv-essentials.sh

When the script is done. The Virtual machine is fully Hyper-V prep'ed and ready to go. And can be used to other things than Hadoop also.

Next: How to install Hadoop on the image

Sunday, May 3, 2015

Adding schema when using ASP.NET 5 Identity with Entity Framework 7

The post is regarding bleeding edge technology(Entity Framework 7-beta 4), and can be outdated in a foreseen future.

Good database design is when responsibilities are separated, or else your ends up with a monolith trash bin database. The best way to achieve separation of responsibilities, is to have each responsibility in its own database, and disable cross querying. That will be a database for Accounts, Products, Emails etc depending of your business and its domain. One database for each domain.

In these cloud days, that be quite expensive. So we can settle with the next best. Schemas. The must famous schema on SQL Server is dbo. It is the default schema, and sadly, it is used in 99% of the cases, when schemas is applied.

When using ASP.NET 5 Identity with Entity Framework 7 and with Migrations, you will see the tables are putted in the dbo schema. Changing this behavior, is not straight forward.

The solution

You might recognize it. It is the code from template when creating an ASP.NET 5 Web site. Thou I have removed a little. I want to have the identity tables in the schema Accounts.

1. First override the OnModelCreating(Modelbuilder builder) method.

 using Microsoft.AspNet.Identity.EntityFramework;  
 using Microsoft.Data.Entity;  
 namespace Example.Models  
 {  
   public class ApplicationUser : IdentityUser  
   {  
   }  
   public class ApplicationDbContext : IdentityDbContext<ApplicationUser>  
   {  
     public ApplicationDbContext()  
     {  
     }  
     protected override void OnModelCreating(ModelBuilder builder)  
     {  
       // Remenber to Create Schema in DB, until EF7 can handle Schemas correctly  
       builder.Entity<ApplicationUser>().ForRelational().Table("AspNetUsers", "Accounts");  
       builder.Entity<IdentityUserClaim<string>>().ForRelational().Table("AspNetUserClaims", "Accounts");  
       builder.Entity<IdentityUserLogin<string>>().ForRelational().Table("AspNetUserLogins", "Accounts");  
       builder.Entity<IdentityUserRole<string>>().ForRelational().Table("AspNetUserRoles", "Accounts");  
       builder.Entity<IdentityRole>().ForRelational().Table("AspNetRoles", "Accounts");  
       builder.Entity<IdentityRoleClaim<string>>().ForRelational().Table("AspNetRoleClaims", "Accounts");  
       base.OnModelCreating(builder);  
     }  
   }  
 }  

2. This step might not seem gracefully, because it should be fully handled by Migrations in Entity Framework. But schemas and Entity Framework 7, is currently not working as desired. And it is not working with Migrations

Go to SQL Server Management Studio. If your database is not created at this point, create it. Then run CREATE SCHEMA <schema name>. In this case it will be CREATE SCHEMA Accounts. As mentioned, this part should have been handled by Migrations.

3. Run Migrations

4. Continue with your project :-)

Taken this a step further, it could be considered; having a DbContext for each domain of you app, and each of these contexts could have their own schemas.

Wednesday, April 29, 2015

Keeping your Azure Website warm and up to speed

So, you have deployed a web app to an Azure Website. As one might expect, the first web site request is slow, it might take 10 seconds  or maybe even more to respond. It is because the web site is unloaded(cold) and it has to load in(warm up). This first request loads the site, and when it is loaded, it responses quite fast(depending on our code).

But after a period(around 30 minutes) with idle traffic, your web site unloads again. And again to get it loaded again, it needs an request. And again it takes time to load it.

In the Basic and Standard plans for Azure websites, you can disable this feature by setting the Always On option. That was a quick fix :-). If you are using Free or Shared Azure Websites, you can consider following strategies:
  1. Do nothing, if you can live with it. Also If/When your site is frequently visited, it is not an problem. It is only websites with low traffic, such as new sites, suffers from this issue.
  2. Use one of these 'Ping my add' services on the web to request your site. I'm quite sure, this is a solution you should avoid.
  3. Find/own/borrow/invent a machine which is on 24/7 and setup a job to make a request to your site every 5-10 minute.
  4. Keep it all in Azure, Create an Azure Webjob to make a request to your site every 5-10 minute.

I'll explain solution 4, it might have some cons regarding pricing, but I'll cover that.

Azure WebJobs

Azure WebJobs can handle following extensions:

.cmd, .bat, .exe, .ps1, .sh, .php, .py, .js, .jar

I'll show an example, with C# where we are making a console .exe file. The Azure WebJob code:

 using System.Net;  
 namespace HeartBeat  
 {  
   class Program  
   {  
     static void Main(string[] args)  
     {  
       var WebReq = (HttpWebRequest)WebRequest.Create(string.Format("http://<your site>/special_ping_endpoint"));  
       WebReq.Method = "GET";  
       WebReq.GetResponse();  
     }  
   }  
 }  

Now, this is important. Do not ping your landing page, eg. www.example.com, it might cost more resources, specially if you landing page makes web call and database lookups behind the scenes. It might be noticeable on your Azure bill. Create a special minimal endpoint for the purpose, and make it return empty.

An example of such endpoint in

ASP.NET MVC 5 and prior
 using System.Web.Mvc;  
 namespace ActionHandlers.Controllers  
 {  
   public class PingController : Controller  
   {  
     public ActionResult Get()  
     {  
       return new EmptyResult();  
     }  
   }  
 }  


Or ASP.NET 5 MVC 6
 using Microsoft.AspNet.Mvc;  
 namespace ActionHandlers.Controllers  
 {  
   public class ExampleController : Controller  
   {  
     public IActionResult Get()  
     {  
       return new EmptyResult();  
     }  
   }  
 }  


Azure WebJob considerations regarding costs

The WebJob is the thing which properly is going to cost you, but it depends. Azure WebJobs is dependent of Azure Scheduler, and Azure Scheduler comes in 3 plans: Free, Standard and Premium. The biggest difference regarding to our case, is how frequent a job can run. With the free plan, a job can run once in an hour, while for Stardard and Premium, they can run once in a minute.

So with a newly created web site, it would be optimal with the standard plan and ping every 5-10 minute, for keeping your site warm. But it is a bit pricy. But you could use the free plan and do a ping every hour and hoping to have a visit after 15-30 minuttes after the ping, then your site is properly warm until next ping. You could consider following strategies, for keeping your site warm.
  1. Completely new and fresh website: Traffic is going to be very light, use the standard plan
  2. Website with light traffic: Use the free plan
  3. Website with often traffic: No plan.
Using Free or Shared Websites with a Azure Scheduler, it still cheaper than switching to Basic Websites and use Always On. Also, using a scheduler should only be a temporary solution, until you site has good traffic. 

Alternative WebJob way

Making an Azure WebJob, with a thread sleep for 5-10 minuttes and run it continuously without a scheduler, is not a recommendable solution. Because Azure is able to unload websites with associated unscheduled WebJobs.

WebJob Installation

Put you WebJob code into a zip, in our case it will be the compiled exe and config file, from either the Debug or Release folder in your Visual Studio Project (I'll take the liberty to assume, your using Visual Studio).

Go to the Dashboard for you Website, and find the WebJob tab. Add the WebJob.

Custom Action Results in ASP.NET 5 (VNEXT) (MVC6)

Before we start, you should be aware of this. This post is based on a the ASP.NET 5 version, Visual Studio 2015 CTP 6 pulls down, when it creates an ASP.NET 5 Project. Meaning it is a pre-release of ASP.NET 5. Things in ASP.NET 5 can change and outdate this post. It is highly unlikely, but it can happen.

Even thou I keep referring MVC 6, it is still a post regarding ASP.NET 5 or ASP.NET vNext, it's the same, and MVC 6 is a part of ASP.NET MVC 5.

Why would I write a custom ActionResult

Yes, good question. There is plenty supported in ASP.NET 5. But sometimes you ends up in a situation, where you need something special. When I developed Your Favorite Snippet Tool, I needed to transfer binary in a certain way, to provide best user experience. I created a custom ActionResult to handle it.

ActionResult in MVC 6 compared to earlier versions

ActionResults have changed a bit, since the prior versions of MVC. Yes, you still have to inherit from ActionResult and yes you still have to override a ExecuteResult method, when making custom ActionResults.

The most noticeable difference, is that in MVC 6 ExecuteResult have another signature compared to prior version, and there is also a ExecuteResultAsync added.

ExecuteResult for ASP.NET MVC 5 and Prior


public abstract void ExecuteResult(ControllerContext context)

ExecuteResult for ASP.NET MVC 6


public virtual Task ExecuteResultAsync(ActionContext context)
public virtual void ExecuteResult(ActionContext context)

Two thing you might notice, the methods in MVC 6 are using virtual methods, and ActionContext instead of ControllerContext. There is nothing much to say about the contexts, they are very similar. By using virtual method there is no override constrains. It means that you can override either ExecuteResultAsync, ExecuteResult or both, but are not forced to.

Which to override, ExecuteResultAsync or ExecuteResult

It depends, but preferly ExecuteResultAsync, because it is the one which is called. Inside ActionResult, which you have to inherit from, following logic is happening:

 public abstract class ActionResult : IActionResult  
   {  
     public virtual Task ExecuteResultAsync(ActionContext context)  
     {  
       ExecuteResult(context);  
       return Task.FromResult(true);  
     }  
     public virtual void ExecuteResult(ActionContext context)  
     {  
     }  
   }  

So you see, ExecuteResultAsync is still called even thou you just override ExecuteResult. Plus, it will not make sense, to enforce overrides of the 2 methods.


Show me some code

I have made a string writer result, not the most exciting example, but it proves the point.

The custom ActionResult

 using Microsoft.AspNet.Mvc;  
 using System.Text;  
 using System.Threading.Tasks;  
 namespace CustomActionResults  
 {  
   internal class StringWriterResult : ActionResult  
   {  
     private byte[] _stringAsByteArray;  
     public StringWriterResult(string stringToWrite)  
     {  
       _stringAsByteArray = Encoding.ASCII.GetBytes(stringToWrite);  
     }  
     public override Task ExecuteResultAsync(ActionContext context)  
     {  
       context.HttpContext.Response.StatusCode = 200;  
       return context.HttpContext.Response.Body.WriteAsync(_stringAsByteArray, 0, _stringAsByteArray.Length);  
     }  
   }  
 }  


String write ActionResult in action:

 using CustomActionResults;  
 using Microsoft.AspNet.Mvc;  
 namespace ActionHandlers.Controllers  
 {  
   public class ExampleController : Controller  
   {  
     public IActionResult Get()  
     {  
       return new StringWriterResult("Hello World!");  
     }  
   }  
 }  

Insert the code in some ASP.NET 5 project, and you should get Hello World!, when hitting ~/Example/Get.


Monday, April 27, 2015

Introducing Your Favorite Snippet Tool

Snippets in Visual Studio and SQL Server Management Studio are a great help, and tremendous time savers. Unfortunately the con about VS and SSMS snippets are: They are tedious to create. I have a feeling, that makes it less appealing to use custom snippets, because it includes working with XML to create them, and VS or SSMS offers no help.

History

Well, a couple of days ago, I decided to make a snippet of some SQL, which I had realized, I had to write regularly in the future. I was pretty tired of writing this SQL, and then I remembered: To create a snippet you have to setup an XML document. Then I got really tired. In hope of an easy solution, I did search the web for an online snippet creator. All i did find was tools, which had to be downloaded. No offence, these downloadables are properly mighty fine, but nowadays I think a tool like a snippet creator should be online. Easy to reach, and not another downloaded tool to soil your computer.

Priorities can be strange some times, and I decided, that I would rather write an online tool myself, which could make snippets, than do another handmade snippet.

So here after a small coding marathon, I'll present to you:

YOUR FAVORITE SNIPPET TOOL(that is the name)



Enjoy!

FAQ

Q: Why is the link www.snippettool.net and not www.yourfavoritesnippettool.com when the tool name is Your Favorite Snippet Tool?
A: For your convenience. It is much easier, to remember www.snippettool.net and type it right. 

Q: The first release is version 0.8.0, is it production ready?
A: Yes. There is some extended features for VB, which will be there in a later release. Further I have some ideas for UI improvements. Also, until I have had some more feedback, it wouldn't be right to release it in version 1.0.0

Q: VSI packages are supported, what about VSIX packages?
A: VSIX does not by default, support snippets. Hacks must be applied to make VSIX work with snippets.

Q:The Visual Studio Content Installer does not install the VSI package to Visual Studio 2xxx, why?
A: That is because, Visual Studio Content Installer is a strange piece of software. Specially, if you have more versions of Visual Studio on your machine.

Q: I found a bug, what to do?
A: I will appreciate, if you would write to me about it. Contact information is at the button of the page. 


Saturday, April 11, 2015

SSIS: An easy SCD optimization for dev and prod

The value of reading this post, depends on how you work with SSIS and how database nursing are handled in within your organization.

The optimization is a single index, but if you only nurse indexes in prod, you could waste a great time when developing SCDs in SSIS. The method is simple, when you now the nature of your SCD, then you can create an index right away, and reduce your development waiting time. Specially if you are testing with bigger volumes of data.

Let me show you

Let's say you have following table definitions, and you working in a SSIS project using Visual Studio:

-- Staging
CREATE TABLE Staging.Customers
(
CustomerId UNIQUEIDENTIFIER,
FistName NVARCHAR(200),
MiddleInitials NVARCHAR(200),
LastName NVARCHAR(200),
AccountId INT,
CreationDate DATETIME2
)
GO 

-- Dimension
CREATE TABLE dbo.dimCustomers
(
CustomerDwhKey INT IDENTITY(1,1),
[Current] BIT, 
CustomerId UNIQUEIDENTIFIER,
FistName NVARCHAR(200),
MiddleInitials NVARCHAR(200),
LastName NVARCHAR(200),
AccountId INT,
CreationDate DATETIME2
CONSTRAINT PK_CustomerId PRIMARY KEY CLUSTERED (CustomerDwhKey)
)
GO


You have a Data Flow, where you transfer data from Staging.Customers to the dimension dbo.dimCustomers using the built-in component Slowly Changing Dimension:


In our example setup CustomerId will be a so-called Business key, And Current will be the indicator for which row are current. It should also be noted, it is possible to have more than one Business keys.


We'll configure attributes as:


Now, the Slowly Changing Dimension component work in following way:

For each entity it recieves, it will search the dimension table for an entity with the same business keys(in plural!!!) and is flagged as current, or in plain SQL:

SELECT  attribute[, attribute] FROM dimension_table WHERE current_flag = true AND business_key = input_business_key[, business_key = input_business_key]

Or as it will look like in our example

SELECT AccountId, CreationDate, FirstName, MiddleInititals, LastName FROM dbo.dimCustomers WHERE [Current] = 1 AND CustomerId = some_key

Further, in case we have an historical change, the current entity in the dimension must be expired by setting Current = 0.

UPDATE dbo.dimCustomers SET [Current] = 0 WHERE [Current] = 1 AND CustomerId = some_key 

The solution

As you might have realize by now, we can improve performance tremendously by putting an index on the current flag and the business keys(again plural!!!). For each entity passing through the Slowly Changing Dimension component, the will be at least 1, but likely 2 searches in the dimension table. And the by knowing the business keys and current flag. and the nature of the Slowly Changing Dimension component, you can predict the index which will improve performance.

The index for our sample will be

CREATE NONCLUSTERED INDEX IX_Current_CustomerId ON dbo.dimCustomers
(
[Current],
CustomerId --- Remember to include each business key
)

Should index be a filtered? I'll let that be up to you.

Indexes, bulk and loading of dimensions

Some tend to drop indexes when loading dimenson, with the argument: Bulk loading is fastest without indexes, which SQL Server has to maintain while loading. This argument has to be revised when working with the Slowly Changing Dimension component.

Because the component searches the dimensions so heavily, (in general) it will be faster loading with indexes than without. If there is no indexes, each entity going through the component, will require at least one table scan, which is quite expensive, and gets more expensive as your dimension grows. 

That's all