Cloudera's Distribution (CDH) provides streamlined installation of Apache Hadoop via Cloudera Manager. Besides Apache Hadoop, CDH also allows installation of other components such as Hive, Pig, HBase, ZooKeeper, etc. in a modular fashion. The free edition of CDH Manager allows you to build and monitor a Hadoop cluster consisting of up to 50 nodes.
If you would like to install and configure HDFS/Hadoop on a small scale, I strongly recommend CDH for you.
You can install Cloudera Manager on Redhat-compatible systems as well as Ubuntu/Debian systems. However, Cloudera Manager can only support cluster nodes which are based on Redhat/CentOS. Therefore you need to install Redhat or CentOS on every cluster node, to be able to have them managed by Cloudera Manager.
In this example, I will show you how to install and configure HDFS and Hadoop using CDH3 (CDH version 3). I assume that there are one Cloudera Manager node, five cluster nodes, and (optionally) one client node (which will access Hadoop cluster).
First, disable SELinux on all cluster nodes, and reboot them:
Make sure that every cluster node as well as Cloudera Manager node has a fully qualified domain name (FQDN) in /etc/sysconfig/network and /etc/hosts. I recommend that /etc/hosts file of every cluster node as well as Cloudera Manager node include FQDNs of all nodes as follows. Otherwise, you may not be able to add cluster nodes to Cloudera Manager.
192.168.212.10 manager.mydomain.com 192.168.212.11 node0.mydomain.com 192.168.212.12 node1.mydomain.com 192.168.212.13 node2.mydomain.com 192.168.212.14 node3.mydomain.com 192.168.212.15 node4.mydomain.com
Make sure to mount the partition used for data storage in each cluster node with "noatime" option. With noatime, read access to a file will no longer result in an update to the atime information associated with the file. For example, /etc/fstab in each cluster node can have:
/dev/sdb1 ext4 noatime 1 1
Make sure to have each and every cluster node accessible via ssh with the identical root password.
Next, install CDH3 on Cloudera Manager node:
Now, go to http://manager.myhost.com:7180/ in your browser to access Cloudera Manager interface. The default login/password for CDH3 is admin/admin.
Add all cluster nodes, and then install/start HDFS/Hadoop on all existing cluster nodes through Cloudera Manager interface. Once HDFS/Hadoop get started by Cloudera Manager, the HDFS storage cluster will have /tmp folder created by default.
Generate client configurations through Cloudera Manager interface, and download the generated global-clientconfig.zip.
On client node (which will read/write files hosted in HDFS, and initiate Hadoop jobs), do the following.
Put the FQDNs of all cluster nodes in /etc/hosts.
Upload global-clientconfig.zip to the client node, and unzip it. It will create hadoop-conf directory, and put HDFS/Hadoop configuration files inside.
Set up environment variable for Hadoop configuration directory.
$ export HADOOP_CONF_DIR=[location of hadoop-conf directory]
Install Hadoop on the client node.
Finally, test if you can access HDFS from the client node as follows.
If the above command shows the content of the local /tmp directory of the client node, instead of /tmp directory created inside the storage cluster, something must be wrong. Double check if HADOOP_CONF_DIR is set up correctly, and configuration files are sane. If the command successfully shows /tmp directory created inside the storage cluster, you are ready to start a Hadoop job from the client node.
Subscribe to Xmodulo
You can have daily Linux tutorials and FAQs delivered to your email inbox. Simply enter your email address below to subscribe to our mailing list. You will receive hands-on guides and carefully written tutorials related to Linux, everything for free.