Sunday, March 6, 2016

Hadoop from scratch notes: Preparing a minimal CentOs Linux Hyper-V image

Motivation

Hadoop on virtual machines? These posts are describing how to setup a hadoop homelab, to get in touch with hadoop. Yet, replace 'virtual' with 'dedicated physical', then you should by on your way to build a production cluster.

Hadoop is yet another good tool in the toolbox, when working with data. Now a days Hadoop is available as cloud service, but it can be pretty expensive, and specially if you just want to train and play with Hadoop. Some vendors as Cloudera offers a single node 'play' version of Hadoop, which is a great way to start. Yet the reason I'mt writing these notes, was that I did find Cloudera very closed and slow, and also I had to use any other Virtual Machine system than Hyper-V. Also it is not that hard to set up a Hadoop node or cluster from scratch.

Not that, I have anything against e.g. Virtual Box. Even thou I see all OS'es as my play grounds, I'm in a Microsoft period(due to my current work), and thereby my Windows box is best suited for virtualization. And it does already comes with Hyper-V, and it actually works good in Windows 10(Earlier versions did lock the CPU clock cycle, and thereby disabled speed-step). I like my machines light, so I would hate to have more than one system for virtualization.

What is the goal?

The goal is to prepare a virtual machine with a minimal version of Centos Linux. The reason, I have selected Centos OS is, it is supported by Microsoft, and it is Azure certified, and when it is Azure certified, it means it can work better with Hyper-V through Hyper-V Integration Services. I could have chosen Ubuntu (Azure's Hadoop cloud solution runs on Ubuntu), but I had a challenge with a very slow apt-get, and generally did find CentOS more light weight.

When we have a fully configured virtual machine, with CentOS and Hadoop, we are going to use it as a template, for creating more Hadoop nodes.

I prefer to setup Hyper-V with PowerShell, it is good fun and practice, and it more compact than images of the GUI. If you are familiar with the Hyper-V GUI, then you should have no trouble to figure out what to press.

Before we start, make sure Hyper-V is enabled, and get CentOS from here https://www.centos.org/download/, the minimal ISO should be sufficient(CentOS 7 is currently the latest version).

A virtual switch

If you don't have a virtual switch configured in Hyper-V, you have to configure one. You are going to use it for connecting you Hadoop nodes, the internet and you working machine together. Thou the internet is optionally. Creating a so-called external virtual switch called "Virtual Switch" (Yes, I know, the creative name is striking :-) ), is done by typing following PowerShell:

New-VMSwitch -Name "Virtual Switch" -NetAdapterName "Wi-Fi" -AllowManagementOS 1

As NetAdaptorName use "Wi-Fi" or "Ethernet", depending on which NIC provides internet.

The virtual machine and disk

Often the virtual machine and the disk interpreted as one, but,  a virtual machine consist of the "Machine" and "disk image" with the OS, and further, of some data disks". We are going to create the machine and OS disk in one go. 

New-VM -Name "Hadoop01" -MemoryStartupBytes 4GB -NewVHDPath D:\VMs\Hadoop01.vhdx -NewVHDSizeBytes 10GB -SwitchName "Virtual Switch"

Memory (4 GigaBytes) and disk (10 GigaBytes) sizes are dymanic by default, but the machine is only configured with 1 CPU. It can be upgraded with:

Set-VMProcessor -VMName Hadoop01 -Count 2

Make the virtual DVD point to the download CentOS image:

Set-VMDvdDrive -VMName Hadoop01 -Path D:\Downloads\CentOS-7-x86_64-Minimal-1511.iso

Let's go:

Start-VM Hadoop01

You have to connect to the virtual machine by the Hyper-V GUI-

Installing CentOS

Press Enter. I might take a while, before reaching next step



Select your preferred language


Check that the properties circled with yellow, are correct. That will make thing easier for you in generel. The properties circled with red, are critical, so make sure to read below how to set them.

Press 'Done', that is all

Turn on the network. Failing to do this, can require you to turn it on, after every reboot.

Set the root password. Create a user for good practice.

After installation and reboot. Log in, so we can get the IP address of our new machine, by typing the following command(ifconfig is not available on CentOS minimal)

ip addr

 Note the IP address, it can be found under Eth0: 
We are not going to use the Hyper-V viewer further. It can't copy-paste between guest and host, and the proper way to connect to a Linux/Unix server is via a SSH client. I recommend Putty (http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html), but the Git Bash is just as fine.
Type in the IP and press Open.

If using the Git Bash, you can write:

ssh <ip> -l <user>

Where <IP> is the noted IP and <user> is either root or the user created earliere.

Installing/Upgrading Microsoft Linux Integration Services(LIS)

We don't have much in the CentOS minimal, and Microsoft haven't made it easy to download the LIS package without a browser.
Fortunately, it is GNU licensed, so I have made a script to get it from my GitHub account, and to install it, together with wget.

curl -O https://raw.githubusercontent.com/ChristianHenrikReich/automation-scripts/master/centos-minimal/install-hyperv-essentials.sh

chmod 755 install-hyperv-essentials.sh

sudo ./install-hyperv-essentials.sh

When the script is done. The Virtual machine is fully Hyper-V prep'ed and ready to go. And can be used to other things than Hadoop also.

Next: How to install Hadoop on the image