Cloudera Distribution for Hadoop (CDH): November 2010

Running Cloudera in Distributed Mode

This section contains instructions for Cloudera Distribution for Hadoop (CDH3) installation on ubuntu. This is CDH quickstart tutorial to setup Cloudera Distribution for Hadoop (CDH3) quickly on debian systems. This is shortest tutorial of Cloudera installation, here you will get all the commands and their description required to install Cloudera in Distributed mode(multi node cluster)

Prerequisite: Before starting Cloudera in distributed mode you must setup Cloudera in pseudo distributed mode and you need at least two machines one for master and another for slave(you can create more then one virtual machine(cluster) on a single machine).

Deploy Cloudera (CDH3) on Cluster:

COMMAND	DESCRIPTION
for x in /etc/init.d/hadoop-* ; do sudo $x stop ; done	Before starting Cloudera in distributed mode first stop each cluster
update-alternatives --display hadoop-0.20-conf	To list alternative Hadoop configurations on Your system
cp -r /etc/hadoop-0.20/conf.empty /etc/hadoop-0.20/conf.cluster	Copy the default configuration to your custom directory
update-alternatives --install /etc/hadoop-0.20/conf hadoop-0.20-conf /etc/hadoop-0.20/conf.cluster 50	To activate the new configuration on Your systems
update-alternatives --display hadoop-0.20-conf	To Check the new configuration on Your systems
or update-alternatives --set hadoop-0.20-conf /etc/hadoop-0.20/conf.cluster	To manually set the configuration
vi /etc/hosts	Then type IP-add master(eg: 192.168.0.1 master) IP-add slave(eg: 192.168.0.2 slave)
sudo apt-get install openssh-server openssh-client	install ssh
ssh-keygen -t rsa -P ""	generating rsa key for passwordless ssh
ssh-copy-id -i $HOME/.ssh/id_rsa.pub slave	setting passwordless ssh
Now go to your custom directory (conf.cluster) and change configuration files
vi masters then erase old contents and type master	masters file defines the namenodes of our multi-node cluster
vi slaves then erase old contents and type slave	slaves file lists the hosts, one per line, where the Hadoop slave daemons (datanodes and tasktrackers) will be run.
vi core-site.xml then type: <property> <name>fs.default.name</name> <value>hdfs://master:54310</value> </property>	Edit configuration file core-site.xml
vi mapred-site.xml then type: <property> <name>mapred.job.tracker</name> <value>master:54311</value> </property>	Edit configuration file mapred-site.xml
vi hdfs-site.xml then type: <property> <name>dfs.replication</name> <value>1</value> </property>	Edit configuration file hdfs-site.xml (value=number of slaves)
Now copy /etc/hadoop-0.20/conf.cluster directory to all nodes in your cluster
update-alternatives --install /etc/hadoop-0.20/conf hadoop-0.20-conf /etc/hadoop-0.20/conf.cluster 50	Set alternative rules on all nodes to activate your configuration.
for service in /etc/init.d/hadoop-0.20-; do sudo $service start; done for x in /etc/init.d/hadoop- ; do sudo $x stop ; done	Restart the daemons on all nodes in your cluster using the service scripts so that the new configuration files are read and then stop them
su -s /bin/bash - hdfs -c 'hadoop namenode -format'	Format namenode manually(Before starting namenode)
You must run the commands on the correct server, according to your role definition
/etc/init.d/hadoop-0.20-namenode start /etc/init.d/hadoop-0.20-secondarynamenode start /etc/init.d/hadoop-0.20-jobtracker start	To start the daemons on namenode on master
/etc/init.d/hadoop-0.20-datanode start /etc/init.d/hadoop-0.20-tasktracker start	To start the daemons on datanode on slave
Congratulations Cloudera CDH setup is completed

Running Cloudera in Pseudo Distributed Mode

COMMAND	DESCRIPTION
$ sudo add-apt-repository "deb http://archive.canonical.com/ lucid partner"	If you are using ubuntu 10.04 LTS run this command
sudo apt-get install sun-java6-jdk	Install java
lsb_release –c	Name of the your distribution (let DISTRO)(eg: hardy or jaunty etc.)
vi /etc/apt/sources.list.d/cloudera.list Then type: deb http://archive.cloudera.com/debian DISTRO-cdh3 contrib deb-src http://archive.cloudera.com/debian DISTRO-cdh3 contrib	A repository enables your package manager to install cloudera replace DISTRO with the name of your distribution
sudo apt-get -y install curl	install curl
curl -s http://archive.cloudera.com/debian/archive.key \| sudo apt-key add -	Add a repository key. Add the Cloudera Public GPG Key to your repository
sudo apt-get update	Update APT package index
sudo apt-get -y install hadoop-0.20-conf-pseudo	Install Hadoop in pseudo-distributed mode: A pseudo-distributed Hadoop installation is composed of one node running all five Hadoop daemons: namenode, jobtracker, secondarynamenode, datanode, and tasktracker
for service in /etc/init.d/hadoop-0.20-*; do sudo $service start; done	Start the Cloudera Daemons
dpkg -L hadoop-0.20-conf-pseudo	Viewing the files on Debian systems
jps	It should give output like this: 14799 NameNode 14977 SecondaryNameNode 15183 DataNode 15596 JobTracker 15897 TaskTracker
Congratulations Cloudrea Setup is Completed. Now lets run some examples
hadoop jar /usr/lib/hadoop/hadoop-*-examples.jar pi 10 100	run pi example
hadoop fs -mkdir input hadoop fs -put /etc/hadoop-0.20/conf/.xml input hadoop-0.20 fs -ls input hadoop-0.20 jar /usr/lib/hadoop-0.20/hadoop--examples.jar grep input output 'dfs[a-z.]+'	run grep example
hadoop-0.20 fs -mkdir inputwords hadoop-0.20 fs -put /etc/hadoop-0.20/conf/.xml inputwords hadoop-0.20 fs -ls inputwords hadoop-0.20 jar /usr/lib/hadoop-0.20/hadoop--examples.jar grep inputwords outputwords 'dfs[a-z.]+'	run word count example
http://localhost:50070/	web based interface for name node
http://localhost:50030/	web based interface for Job tracker
for x in /etc/init.d/hadoop-* ; do sudo $x stop ; done	Shutdown CDH2 Hadoop services

Running Cloudera in Standalone Mode

COMMAND	DESCRIPTION
$ sudo add-apt-repository "deb http://archive.canonical.com/ lucid partner"	If you are using ubuntu 10.04 LTS run this command
sudo apt-get install sun-java6-jdk	Install java
lsb_release –c	Name of the your distribution (let DISTRO)(eg: hardy or jaunty or lucid etc.)
vi /etc/apt/sources.list.d/cloudera.list Then type: deb http://archive.cloudera.com/debian DISTRO-cdh3 contrib deb-src http://archive.cloudera.com/debian DISTRO-cdh3 contrib	A repository enables your package manager to install cloudera replace DISTRO with the name of your distribution
sudo apt-get -y install curl	install curl
curl -s http://archive.cloudera.com/debian/archive.key \| sudo apt-key add -	Add a repository key. Add the Cloudera Public GPG Key to your repository
sudo apt-get update	Update APT package index
apt-cache search hadoop	List Hadoop packages on Debian systems
apt-get -y install hadoop-0.20	Install hadoop
dpkg -L hadoop-0.20	List the installed files
man hier	See that the Hadoop package has been configured
Congratulations Cloudrea Setup is Completed. Now lets run some examples
hadoop jar /usr/lib/hadoop-0.20/hadoop-*-examples.jar pi 10 100	Run pi example
cd /tmp mkdir input cp /etc/hadoop/conf/.xml input hadoop jar /usr/lib/hadoop-0.20/hadoop--examples.jar grep input output 'dfs[a-z.]+' cat output/*	Run grep example
cd /tmp mkdir inputwords cp /etc/hadoop/conf/.xml inputwords hadoop jar /usr/lib/hadoop-0.20/hadoop--examples.jar wordcount inputwords outputwords	Run word count example