Lightweight Virtual Machine Hadoop Distributed File System Deployment


Posted by: Anas Katib
Last updated: 04-03-2017

Need more details? See References.

If you just want to experiment with the Hadoop Distributed File System (HDFS), there is no need to set it up permanently on your local machine. Furthermore, if Windows is your primary operating system, you do not have to install Linux on your machine in order to give HDFS a try.

You can set up HDFS on a virtual machine and interact with it through your local operating system. Nevertheless, running a virtual machine can be very frustrating if you don't have a large amount of memory. Therefore, in this tutorial we are going to use a light weight operating system for the virtual machine: Arch Linux.

Prerequisites:

  • Download and install VirtualBox for your host machine.
  • Download Arch Linux: use any of the HTTP direct download links to download the "archlinux-****.iso" file.
Add networking interfaces from the general network preferences:
  • NAT Networks
add NAT network
Add a new NAT network if it none exists
Disable IPv6
Edit the network: enable the network and disable IPv6 support
  • Host-only Networks
add H-only network
Add a host-only network
Leave out IPv6 address
Edit the network: do not enter anything for the IPv6 address
DHCP Address
Set the server configuration
Create a new virtual machine and set up the operating system and memory:
set VM OS
Give your VM a name, select Linux type and Arch Linux version. 1G of your memory is sufficient for this tutorial.
Set up a fixed hard disk
Create a fixed-size virtual hard disk.
Click on Settings to set up the created VM instance:
  • Storage
Select the empty C:IDE
Select the "Empty" slot under Controller: IDE, then click on the CD icon and choose the downloaded "archlinux-****.iso" file.
Select iso file
Next, click on the network tab.
  • Network
Enable Adaptor 1, attach to BrAdap
Enable Adaptor 1 and attach it to a Bridged Adaptor.
Enable Adaptor 2, attach to HoAdap
Enable Adaptor 2 and attach it to a Host-only Adaptor.
  • Shared Folders (optional but preferred)
Click on + to add a shared folder
Add a new shared folder.
select shared folder on the host
Create a new folder (or specify an existing one) and remember the folder name. Click OK.
Click OK
VM instance setup is complete. Click on OK.
The following section is based on David Goguen's tutorial: How to Install Arch Linux
I suggest that you watch the whole video ('speeded playback') first, then continue with this tutorial. There is only a slight variation in some steps.
Install the operating system on the virtual hard disk:
Start VM
Start the VM instance.
Using the keyboard: select "Boot Arch Linux" and press enter.
Partition the /dev/sda Hard Disk:
partition /dev/sda
Type "cfdisk /dev/sda" and press enter.
Select dos label
Select "dos" for the label type.
  • Swap Partition
new primary swap partition
Create a "New" (swap) partition
512M size swap
Set size to "512M".
primary
Select as "primary" (not extended) partition.
  • Root Partition
new primary root partition
Create a "New" (priamry) partition in the "Free space"
4.5G size swap
Set size to "4.5G".
primary
Select as "primary" (not extended) partition.
make root partition bootable
Select the newly created root partition and make it "Bootable"
  • Swap Partition Type
change swap partition type
select the swap partition "/dev/sda1" and change its "Type"
set type 82 Linux swap
Set partition type to "82 Linux swap".
  • Save Created Partitions
Select Write
Select "Write" and press enter.
type yes
Type "yes" and press enter.
Quit cfdisk
Select "Quit" and press enter.
  • Format and Mount Root Partition
format root partition
Type "mkfs.ext4 /dev/sda2" and press enter.
mount root partition
Type "mount /dev/sda2 /mnt" and press enter.
  • Make and Enable Swap Area on Swap Partition
create swap area
Type "mkswap /dev/sda1" and press enter.
enable swap area
Type "swapon /dev/sda1" and press enter.
  • Install OS on Root Partition
Install OS on root
Type "pacstrap /mnt base base-devel" and press enter.
  • Configure OS
    • Users
change to root
Type "arch-chroot /mnt" and press enter.
(Optional) Type "passwd" and press enter. Enter a password.
add hduser and set password
Type "useradd -m -g users -s /bin/bash hduser" and press enter.
Type "passwd hduser" and press enter. Enter a password.
add hduser and set password
Type "nano /etc/sudoers" and press enter.
add hduser and set password
Add the line "hduser ALL=(ALL) ALL" then Write Out, enter and Exit.
    • Language
locale setup
Type "nano /etc/locale.gen", press enter, and uncomment your language then Write Out, enter and Exit.
generate locale
Type "locale-gen" and press enter.
    • Time Zone
set timezone
Type "ln -f -s /usr/share/zoneinfo/US/Central /etc/localtime"
and press enter. (You can replace US/Central with your timezone.)
    • Hostname
set hostname
Type "echo vbox > /etc/hostname" and press enter.
    • Bootloader
install grub bootloader
Type "pacman -S grub-bios" and press enter, then "Y" and enter.
set up grub
Type "grub-install /dev/sda" and press enter.
generate hardware image
Type "mkinitcpio -p linux" and press enter.
create grub configuration
Type "grub-mkconfig -o /boot/grub/grub.cfg" and press enter.
    • FSTAB
exit chroot
Type "exit" and press enter to exit arch-chroot.
write fstab
Type "genfstab /mnt >> /mnt/etc/fstab" and press enter.
unmount /mnt and shutdown
Type "umount /mnt" and press enter.
Type "shutdown now" and press enter.
    • Boot Order
boot order
Click on "Settings" to set up the boot media.
Click on "System" and uncheck "Floppy" and "Optical" from the "Motherboard" : "Boot Order".
Then click "OK".
    • Login
boot order
Start the instance and select "Arch Linux" and press enter.
Log in with the "hduser" username and its password.
    • Internet
Ping google.com
Test internet connectivity by executing "ping google.com" if you get a response you can stop pinging by pressing "CTRL C".
Enable DHCP
If you get a "Name or service not known" error and you have internt access on your host machine, it is likely that the DHCP service was not started.
Therefore, type "sudo systemctl enable dhcpcd.service" and press enter. Next, type "sudo systemctl start dhcpcd.service" and press enter enter.
    • Refresh Packages
Clean packages
Type "sudo pacman -Sc" and press enter, then Y Y.
Sync and upgrade packages
Type "sudo pacman -Syu" and press enter, then Y.
Sync and upgrade packages
If you receive a "Failed to commit transaction
(conflicting files)" error, rename (or delete)
the file that is causing the error. For example: type
"sudo mv /etc/ssl/certs/ca-certificates.crt new.crt"
and press enter then re-execute the previous command.
    • SSH
Install openssh
Type "sudo pacman -S openssh" and press enter, then Y.
Enable SSH
Type "sudo systemctl enable sshd" and press enter, then type "sudo systemctl start sshd" and press enter.
get ip address
Type "hostname --ip-address" and press enter to get the IP address of the virtual machine.
Login ssh
From Terminal (or PuTTY on Windows) on your host machine Type "ssh hduser@[VM IP]" and press enter then yes and your password.
    • Headers and Guest Additions
Install Liunx headers download
Type "sudo pacman -S linux-headers" then press enter, then Y.
Install guest additions
Type "sudo pacman -S virtualbox-guest-utils-nox" and press enter, then 1 and Y.
Load modules
Type "uname -r" and press enter to find out your kernel release. Next, generate and load some of the required modules by executing the following commands:
"sudo depmod KERNEL_RELEASE"
"sudo modprobe -a vboxguest vboxsf vboxvideo".
    • Java
Install Java
Type "sudo pacman -S jdk7-openjdk" then press enter, then Y.
    • Shared Folder
Mount shared_folder
Create a mount point (i.e. a directory) and mount the shared folder by executing the following commands:
"mkdir ~/shared_folder"
"sudo mount -t vboxsf -o gid=1000,uid=1000 shared_folder ~/shared_folder"

If successful, executing the following command will create a file that is viewable from both the guest and host machines:
"echo 'Hello World' > ~/shared_folder/test_file.txt".
Edit fstab
In order to make the mounting permanent, add an
entry to /etc/fstab. Type "sudo nano /etc/fstab"
and press enter.
Permanent mount
Add the following line: "shared_folder /home/hduser/shared_folder vboxsf uid=1000,gid=1000 0 0" then Write Out, enter then Exit.
Enable Symlinks on shared folder
Enable the creation of symbolic links on the shared folder by executing the following command in your host machine's terminal:

- Unix-based host:
"VBoxManage setextradata VM_NAME VBoxInternal2/SharedFoldersEnableSymlinksCreate/SHARED_FOLDER_NAME 1"

- Windows host:
"VBoxManage.exe setextradata VM_NAME VBoxInternal2/SharedFoldersEnableSymlinksCreate/SHARED_FOLDER_NAME 1"
In this section, we are going to download the Hadoop binaries to configure HDFS. For connivance, you can download the Hadoop files into the shared folder and edit them using a text editor with a GUI.
Get a download link for an HTTP mirror to download the binary tarball from hadoop.apache.org/releases.html or Hadoop 2.7.3.
  • Download Hadoop Binaries
Hadoop download link
Download the file directly into your shared folder.
Download file
Alternatively, you can copy the download url and download it from the terminal by executing:
"curl -O http://apache.claz.org/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz"
Extract file
Extract the compressed tarball by executing:
"tar xzf hadoop-2.7.3.tar.gz"
  • Pseudo-Distributed Configuration

  • Edit the following files and copy the corresponding content:
    • core-site.xml
open core-site.xml
Type "nano hadoop-2.7.3/etc/hadoop/core-site.xml" and press enter to edit the file.

					 
edit core-site.xml
Enter the following configuration using your machine's IP address:
<configuration>
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://YOUR.IP.ADDRESS:9000</value>
  </property>
</configuration>
    • yarn-site.xml
Edit core-site.xml
Type "nano hadoop-2.7.3/etc/hadoop/yarn-site.xml"
and press enter to edit the file. Enter the following
configuration:
<configuration>
  <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
  </property>

  <property>
    <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
    <value>org.apache.hadoop.mapred.ShuffleHandler</value>
  </property>
</configuration>
                        
    • mapred-site.xml
copy mapred-site.xml
Use the mapred-site.xml.template to enter the mapred-site.xml configuration.
Copy template:
"cp hadoop-2.7.3/etc/hadoop/mapred-site.xml.template hadoop-2.7.3/etc/hadoop/mapred-site.xml"
Edit the file:
"nano hadoop-2.7.3/etc/hadoop/mapred-site.xml"
edit mapred-site.xml
Enter the following configuration:
<configuration>
  <property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
  </property>
</configuration>
    • hdfs-site.xml
edit hdfs-site.xml
Edit the hdfs-site.xml file
"nano hadoop-2.7.3/etc/hadoop/hdfs-site.xml"
and enter the following configuration:
<configuration>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>

  <property>
    <name>dfs.permissions.enabled</name>
    <value>false</value>
  </property>

  <property>
    <name>dfs.namenode.name.dir</name>
    <value>file:/usr/local/hadoop/hadoop_data/hdfs/namenode</value>
  </property>

  <property>
    <name>dfs.datanode.data.dir</name>
    <value>file:/usr/local/hadoop/hadoop_data/hdfs/datanode</value>
  </property>
</configuration>
                		 
  • Hadoop Variables
Edit bashrc
Type "nano ~/.bashrc" and enter the following lines:
#Hadoop Variables
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_COMMON_LIB_NATIVE_DIR"
Edit hadoop-env.sh
Type "nano hadoop-2.7.3/etc/hadoop/hadoop-env.sh"
and modify the line that specifies the used Java implementation as follows:
# The java implementation to use.
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk
  • Hadoop Directory
Create Hadoop directory
Execute the following commands:
"sudo mv hadoop-2.7.3 /usr/local/hadoop"
"sudo mkdir -p /usr/local/hadoop/hadoop_data/hdfs/{namenode,datanode}"
"sudo chown hduser:users -R /usr/local/hadoop"
  • Passphraseless SSH
generate keys
Execute the following commands on the guest machine to enable ssh login without a passphrase:
"ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa"
"cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys"
"chmod 0600 ~/.ssh/authorized_keys"
add localhost to knowns
Execute the following command on the guest machine to add the localhost to the list of known hosts:
"ssh-keyscan -H localhost,0.0.0.0,YOUR_IP_ADDRESS >> ~/.ssh/known_hosts"
  • Format HDFS
Format hdfs
Execute the following command on the guest machine:
"source ~/.bashrc"
"hdfs namenode -format"
  • Start HDFS and YARN
Start dfs yarn
Execute the following command on the guest machine:
"start-dfs.sh && start-yarn.sh"
  • Checkout the Web UI for HDFS and YARN
WebUI
Hadoop:
YOUR.IP.ADDRESS:50070

Yarn:
YOUR.IP.ADDRESS:8088

NOTICE THE HDFS MASTER IS LISTENING ON PORT 9000
YOUR.IP.ADDRESS:9000
  • Test HDFS
Create dir
Create a directory on HDFS:
"hadoop fs -mkdir /helloDir"

Copy a local file to the created directory:
"hadoop fs -copyFromLocal shared_folder/test_file.txt /helloDir/"

Print its content:
"hadoop fs -cat /helloDir/test_file.txt"
Java code in an IDE
Test your setup from your host machine.

Sample Code:
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.log4j.Level;
import org.apache.log4j.Logger;

public class App {
  public static void main(String[] args) throws IOException {
    Logger.getRootLogger().setLevel(Level.OFF);

    Configuration config = new Configuration();
    config.set("fs.default.name", "hdfs://YOUR.IP.ADDRESS:9000/");

    Path inputFilePath = new Path("/helloDir/test_file.txt");
    FileSystem dfs = FileSystem.get(config);
    BufferedReader br = new BufferedReader(new InputStreamReader(dfs.open(inputFilePath)));
    String line;
    while ( (line = br.readLine()) != null) {
      System.out.println(line);
    }

    dfs.close();
  }
}
						
Use the following pom.xml to automatically download all of the requirements:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>edu.umkc.sce.csee.dbis.hadoop</groupId>
  <artifactId>Hadoop</artifactId>
  <version>0.0.1-SNAPSHOT</version>
  <packaging>jar</packaging>
  <name>Hadoop</name>
  <url>http://maven.apache.org</url>
  <properties>
   <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
  </properties>
  <dependencies>
   <dependency>
    <groupId>junit</groupId>
    <artifactId>junit</artifactId>
    <version>3.8.1</version>
    <scope>test</scope>
   </dependency>
   <dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-client</artifactId>
    <version>2.7.3</version>
   </dependency>
  </dependencies>
</project>

  • End
End
Remember to stop dfs and yarn before you turn off the guest machine or before modifiying the configuration files:
"stop-yarn.sh && stop-dfs.sh"

Turn off the guest mashine and exit when done:
"sudo shutdown && exit"

The machine will be turned off in 60s.