Anas Katib

Lightweight Virtual Machine Hadoop Distributed File System Deployment

Posted by: Anas Katib
Last updated: 04-03-2017

Need more details? See References.

If you just want to experiment with the Hadoop Distributed File System (HDFS), there is no need to set it up permanently on your local machine. Furthermore, if Windows is your primary operating system, you do not have to install Linux on your machine in order to give HDFS a try.

You can set up HDFS on a virtual machine and interact with it through your local operating system. Nevertheless, running a virtual machine can be very frustrating if you don't have a large amount of memory. Therefore, in this tutorial we are going to use a light weight operating system for the virtual machine: Arch Linux.

Prerequisites:

Download and install VirtualBox for your host machine.
Download Arch Linux: use any of the HTTP direct download links to download the "archlinux-****.iso" file.

Add networking interfaces from the general network preferences:

NAT Networks

add NAT network — Add a new NAT network if it none exists

Disable IPv6 — Edit the network: enable the network and disable IPv6 support

Host-only Networks

Add a host-only network	Edit the network: do not enter anything for the IPv6 address
Set the server configuration

Create a new virtual machine and set up the operating system and memory:

set VM OS — Give your VM a name, select Linux type and Arch Linux version. 1G of your memory is sufficient for this tutorial.

Set up a fixed hard disk — Create a fixed-size virtual hard disk.

Click on Settings to set up the created VM instance:

Storage

Select the empty C:IDE — Select the "Empty" slot under Controller: IDE, then click on the CD icon and choose the downloaded "archlinux-****.iso" file.

Select iso file — Next, click on the network tab.

Network

Enable Adaptor 1, attach to BrAdap — Enable Adaptor 1 and attach it to a Bridged Adaptor.

Enable Adaptor 2, attach to HoAdap — Enable Adaptor 2 and attach it to a Host-only Adaptor.

Shared Folders (optional but preferred)

Add a new shared folder.	Create a new folder (or specify an existing one) and remember the folder name. Click OK.
VM instance setup is complete. Click on OK.

The following section is based on David Goguen's tutorial: How to Install Arch Linux
I suggest that you watch the whole video ('speeded playback') first, then continue with this tutorial. There is only a slight variation in some steps.

Install the operating system on the virtual hard disk:

Using the keyboard: select "Boot Arch Linux" and press enter.

Partition the /dev/sda Hard Disk:

partition /dev/sda — Type "cfdisk /dev/sda" and press enter.

Select dos label — Select "dos" for the label type.

Swap Partition

Create a "New" (swap) partition	Set size to "512M".
Select as "primary" (not extended) partition.

Root Partition

Create a "New" (priamry) partition in the "Free space"	Set size to "4.5G".
Select as "primary" (not extended) partition.	Select the newly created root partition and make it "Bootable"

Swap Partition Type

change swap partition type — select the swap partition "/dev/sda1" and change its "Type"

set type 82 Linux swap — Set partition type to "82 Linux swap".

Save Created Partitions

Select "Write" and press enter.	Type "yes" and press enter.
Select "Quit" and press enter.

Format and Mount Root Partition

format root partition — Type "mkfs.ext4 /dev/sda2" and press enter.

mount root partition — Type "mount /dev/sda2 /mnt" and press enter.

Make and Enable Swap Area on Swap Partition

create swap area — Type "mkswap /dev/sda1" and press enter.

enable swap area — Type "swapon /dev/sda1" and press enter.

Install OS on Root Partition

Install OS on root — Type "pacstrap /mnt base base-devel" and press enter.

Configure OS

Users

Type "arch-chroot /mnt" and press enter. (Optional) Type "passwd" and press enter. Enter a password.	Type "useradd -m -g users -s /bin/bash hduser" and press enter. Type "passwd hduser" and press enter. Enter a password.
Type "nano /etc/sudoers" and press enter.	Add the line "hduser ALL=(ALL) ALL" then Write Out, enter and Exit.

Language

locale setup — Type "nano /etc/locale.gen", press enter, and uncomment your language then Write Out, enter and Exit.

generate locale — Type "locale-gen" and press enter.

Time Zone

set timezone — Type "ln -f -s /usr/share/zoneinfo/US/Central /etc/localtime"
and press enter. (You can replace US/Central with your timezone.)

Hostname

set hostname — Type "echo vbox > /etc/hostname" and press enter.

Bootloader

Type "pacman -S grub-bios" and press enter, then "Y" and enter.	Type "grub-install /dev/sda" and press enter.
Type "mkinitcpio -p linux" and press enter.	Type "grub-mkconfig -o /boot/grub/grub.cfg" and press enter.

FSTAB

Type "exit" and press enter to exit arch-chroot.	Type "genfstab /mnt >> /mnt/etc/fstab" and press enter.
Type "umount /mnt" and press enter. Type "shutdown now" and press enter.

Boot Order

boot order — Click on "Settings" to set up the boot media.
Click on "System" and uncheck "Floppy" and "Optical" from the "Motherboard" : "Boot Order".
Then click "OK".

Internet

Ping google.com — Test internet connectivity by executing "ping google.com" if you get a response you can stop pinging by pressing "CTRL C".

Enable DHCP — If you get a "Name or service not known" error and you have internt access on your host machine, it is likely that the DHCP service was not started.
Therefore, type "sudo systemctl enable dhcpcd.service" and press enter. Next, type "sudo systemctl start dhcpcd.service" and press enter enter.

Refresh Packages

Clean packages — Type "sudo pacman -Sc" and press enter, then Y Y.

Sync and upgrade packages — Type "sudo pacman -Syu" and press enter, then Y.

Type "sudo pacman -S openssh" and press enter, then Y.	Type "sudo systemctl enable sshd" and press enter, then type "sudo systemctl start sshd" and press enter.
Type "hostname --ip-address" and press enter to get the IP address of the virtual machine.	From Terminal (or PuTTY on Windows) on your host machine Type "ssh hduser@[VM IP]" and press enter then yes and your password.

Headers and Guest Additions

Type "sudo pacman -S linux-headers" then press enter, then Y.	Type "sudo pacman -S virtualbox-guest-utils-nox" and press enter, then 1 and Y.
Type "uname -r" and press enter to find out your kernel release. Next, generate and load some of the required modules by executing the following commands: "sudo depmod KERNEL_RELEASE" "sudo modprobe -a vboxguest vboxsf vboxvideo".

Java

Install Java — Type "sudo pacman -S jdk7-openjdk" then press enter, then Y.

Shared Folder

Create a mount point (i.e. a directory) and mount the shared folder by executing the following commands: "mkdir ~/shared_folder" "sudo mount -t vboxsf -o gid=1000,uid=1000 shared_folder ~/shared_folder" If successful, executing the following command will create a file that is viewable from both the guest and host machines: "echo 'Hello World' > ~/shared_folder/test_file.txt".	In order to make the mounting permanent, add an entry to /etc/fstab. Type "sudo nano /etc/fstab" and press enter.
Add the following line: "shared_folder /home/hduser/shared_folder vboxsf uid=1000,gid=1000 0 0" then Write Out, enter then Exit.	Enable the creation of symbolic links on the shared folder by executing the following command in your host machine's terminal: - Unix-based host: "VBoxManage setextradata VM_NAME VBoxInternal2/SharedFoldersEnableSymlinksCreate/SHARED_FOLDER_NAME 1" - Windows host: "VBoxManage.exe setextradata VM_NAME VBoxInternal2/SharedFoldersEnableSymlinksCreate/SHARED_FOLDER_NAME 1"

In this section, we are going to download the Hadoop binaries to configure HDFS. For connivance, you can download the Hadoop files into the shared folder and edit them using a text editor with a GUI.
Get a download link for an HTTP mirror to download the binary tarball from hadoop.apache.org/releases.html or Hadoop 2.7.3.

Download Hadoop Binaries

Download the file directly into your shared folder.	Alternatively, you can copy the download url and download it from the terminal by executing: "curl -O http://apache.claz.org/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz"
Extract the compressed tarball by executing: "tar xzf hadoop-2.7.3.tar.gz"

Pseudo-Distributed Configuration

core-site.xml

open core-site.xml — Type "nano hadoop-2.7.3/etc/hadoop/core-site.xml" and press enter to edit the file.

edit core-site.xml — Enter the following configuration using your machine's IP address:

<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://YOUR.IP.ADDRESS:9000</value> </property> </configuration>

yarn-site.xml

mapred-site.xml

copy mapred-site.xml — Use the mapred-site.xml.template to enter the mapred-site.xml configuration.
Copy template:
"cp hadoop-2.7.3/etc/hadoop/mapred-site.xml.template hadoop-2.7.3/etc/hadoop/mapred-site.xml"
Edit the file:
"nano hadoop-2.7.3/etc/hadoop/mapred-site.xml"

edit mapred-site.xml — Enter the following configuration:

<configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration>

hdfs-site.xml

edit hdfs-site.xml — Edit the hdfs-site.xml file
"nano hadoop-2.7.3/etc/hadoop/hdfs-site.xml"
and enter the following configuration:

<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.permissions.enabled</name> <value>false</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:/usr/local/hadoop/hadoop_data/hdfs/namenode</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>file:/usr/local/hadoop/hadoop_data/hdfs/datanode</value> </property> </configuration>

Hadoop Variables

Edit bashrc — Type "nano ~/.bashrc" and enter the following lines:
#Hadoop Variables
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_COMMON_LIB_NATIVE_DIR"

Edit hadoop-env.sh — Type "nano hadoop-2.7.3/etc/hadoop/hadoop-env.sh"
and modify the line that specifies the used Java implementation as follows:
# The java implementation to use.
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk

Hadoop Directory

Create Hadoop directory — Execute the following commands:
"sudo mv hadoop-2.7.3 /usr/local/hadoop"
"sudo mkdir -p /usr/local/hadoop/hadoop_data/hdfs/{namenode,datanode}"
"sudo chown hduser:users -R /usr/local/hadoop"

Passphraseless SSH

generate keys — Execute the following commands on the guest machine to enable ssh login without a passphrase:
"ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa"
"cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys"
"chmod 0600 ~/.ssh/authorized_keys"

add localhost to knowns — Execute the following command on the guest machine to add the localhost to the list of known hosts:
"ssh-keyscan -H localhost,0.0.0.0,YOUR_IP_ADDRESS >> ~/.ssh/known_hosts"

Format HDFS

Format hdfs — Execute the following command on the guest machine:
"source ~/.bashrc"
"hdfs namenode -format"

Start HDFS and YARN

Start dfs yarn — Execute the following command on the guest machine:
"start-dfs.sh && start-yarn.sh"

Checkout the Web UI for HDFS and YARN

WebUI — Hadoop:
YOUR.IP.ADDRESS:50070

Yarn:
YOUR.IP.ADDRESS:8088

NOTICE THE HDFS MASTER IS LISTENING ON PORT 9000
YOUR.IP.ADDRESS:9000

Test HDFS

Create dir — Create a directory on HDFS:
"hadoop fs -mkdir /helloDir"

Copy a local file to the created directory:
"hadoop fs -copyFromLocal shared_folder/test_file.txt /helloDir/"

Print its content:
"hadoop fs -cat /helloDir/test_file.txt"

Java code in an IDE — Test your setup from your host machine.

Use the following pom.xml to automatically download all of the requirements:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>edu.umkc.sce.csee.dbis.hadoop</groupId>
<artifactId>Hadoop</artifactId>
<version>0.0.1-SNAPSHOT</version>
<packaging>jar</packaging>
<name>Hadoop</name>
<url>http://maven.apache.org</url>
<properties>
   <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
<dependencies>
   <dependency>
    <groupId>junit</groupId>
    <artifactId>junit</artifactId>
    <version>3.8.1</version>
    <scope>test</scope>
   </dependency>
   <dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-client</artifactId>
    <version>2.7.3</version>
   </dependency>
</dependencies>
</project>

End — Remember to stop dfs and yarn before you turn off the guest machine or before modifiying the configuration files:
"stop-yarn.sh && stop-dfs.sh"

Turn off the guest mashine and exit when done:
"sudo shutdown && exit"

The machine will be turned off in 60s.

Can't create symlinks in virtualbox shared folders
Hadoop - MapReduce
HADOOP 2.7.0 SINGLE NODE CLUSTER SETUP ON UBUNTU 15.04
Hadoop: How to read a file from HDFS in Hadoop classes in Java
Hadoop: Setting up a Single Node Cluster. 1
Hadoop: Setting up a Single Node Cluster. 2
How to Install Arch Linux
Installation steps for Arch Linux guests
Running Hadoop on Ubuntu Linux (Single-Node Cluster)
Setting Up VirtualBox Shared Folders
VirtualBox: mount.vboxsf Question 28328775

Lightweight Virtual Machine Hadoop Distributed File System Deployment

Prerequisites:

VM Setup

OS Setup

HDFS Setup

References