How to install HDFS and Hadoop using CDH3

Cloudera's Distribution (CDH) provides streamlined installation of Apache Hadoop via Cloudera Manager. Besides Apache Hadoop, CDH also allows installation of other components such as Hive, Pig, HBase, ZooKeeper, etc. in a modular fashion. The free edition of CDH Manager allows you to build and monitor a Hadoop cluster consisting of up to 50 nodes.

If you would like to install and configure HDFS/Hadoop on a small scale, I strongly recommend CDH for you.

You can install Cloudera Manager on Redhat-compatible systems as well as Ubuntu/Debian systems.  However, Cloudera Manager can only support cluster nodes which are based on Redhat/CentOS. Therefore you need to install Redhat or CentOS on every cluster node, to be able to have them managed by Cloudera Manager.

In this example, I will show you how to install and configure HDFS and Hadoop using CDH3 (CDH version 3).  I assume that there are one Cloudera Manager node, five cluster nodes, and (optionally) one client node (which will access Hadoop cluster).

First, disable SELinux on all cluster nodes, and reboot them:

$ sudo vi /etc/sysconfig/selinux
$ sudo chkconfig iptables off

Make sure that every cluster node as well as Cloudera Manager node has a fully qualified domain name (FQDN) in /etc/sysconfig/network and /etc/hosts. I recommend that /etc/hosts file of every cluster node as well as Cloudera Manager node include FQDNs of all nodes as follows. Otherwise, you may not be able to add cluster nodes to Cloudera Manager.

$ sudo vi /etc/hosts

Make sure to mount the partition used for data storage in each cluster node with "noatime" option. With noatime, read access to a file will no longer result in an update to the atime information associated with the file. For example, /etc/fstab in each cluster node can have:

/dev/sdb1  ext4  noatime 1 1

Make sure to have each and every cluster node accessible via ssh with the identical root password.

Next, install CDH3 on Cloudera Manager node:

$ wget
$ ./cloudera-manager-installer.bin

Now, go to in your browser to access Cloudera Manager interface. The default login/password for CDH3 is admin/admin.

Add all cluster nodes, and then install/start HDFS/Hadoop on all existing cluster nodes through Cloudera Manager interface. Once HDFS/Hadoop get started by Cloudera Manager, the HDFS storage cluster will have /tmp folder created by default.

Generate client configurations through Cloudera Manager interface, and download the generated

On client node (which will read/write files hosted in HDFS, and initiate Hadoop jobs), do the following.

Put the FQDNs of all cluster nodes in /etc/hosts.

Upload to the client node, and unzip it. It will create hadoop-conf directory, and put HDFS/Hadoop configuration files inside.

Set up environment variable for Hadoop configuration directory.

$ export HADOOP_CONF_DIR=[location of hadoop-conf directory]

Install Hadoop on the client node.

Finally, test if you can access HDFS from the client node as follows.

$ hadoop dfs -ls /tmp

If the above command shows the content of the local /tmp directory of the client node, instead of /tmp directory created inside the storage cluster, something must be wrong. Double check if HADOOP_CONF_DIR is set up correctly, and configuration files are sane.  If the command successfully shows /tmp directory created inside the storage cluster, you are ready to start a Hadoop job from the client node.

Subscribe to Xmodulo

Do you want to receive Linux FAQs, detailed tutorials and tips published at Xmodulo? Enter your email address below, and we will deliver our Linux posts straight to your email box, for free. Delivery powered by Google Feedburner.

Leave a comment

Your email address will not be published. Required fields are marked *

Current day month ye@r *