Running Cloudera in Distributed Mode

This section contains instructions for Cloudera Distribution for Hadoop (CDH3) installation on ubuntu. This is CDH quickstart tutorial to setup Cloudera Distribution for Hadoop (CDH3) quickly on debian systems. This is shortest tutorial of Cloudera installation, here you will get all the commands and their description required to install Cloudera in Distributed mode(multi node cluster)

Prerequisite: Before starting Cloudera in distributed mode you must setup Cloudera in pseudo distributed mode and you need at least two machines one for master and another for slave(you can create more then one virtual machine(cluster) on a single machine).


Deploy Cloudera (CDH3) on Cluster:
COMMAND DESCRIPTION
for x in /etc/init.d/hadoop-* ; do sudo $x stop ; done Before starting Cloudera in distributed mode first stop each cluster
update-alternatives --display hadoop-0.20-conf To list alternative Hadoop configurations on Your system
cp -r /etc/hadoop-0.20/conf.empty /etc/hadoop-0.20/conf.cluster Copy the default configuration to your custom directory
update-alternatives --install /etc/hadoop-0.20/conf hadoop-0.20-conf /etc/hadoop-0.20/conf.cluster 50 To activate the new configuration on Your systems
update-alternatives --display hadoop-0.20-conf To Check the new configuration on Your systems
or
update-alternatives --set hadoop-0.20-conf /etc/hadoop-0.20/conf.cluster
To manually set the configuration
vi /etc/hosts Then type
IP-add master(eg: 192.168.0.1 master)
IP-add slave(eg: 192.168.0.2 slave)
sudo apt-get install openssh-server openssh-client install ssh
ssh-keygen -t rsa -P "" generating rsa key for passwordless ssh
ssh-copy-id -i $HOME/.ssh/id_rsa.pub slave setting passwordless ssh
Now go to your custom directory (conf.cluster) and change configuration files
vi masters
then erase old contents and type master
masters file defines the namenodes of our multi-node cluster
vi slaves
then erase old contents and type slave
slaves file lists the hosts, one per line, where the Hadoop slave daemons (datanodes and tasktrackers) will be run.
vi core-site.xml
then type:
<property>
  <name>fs.default.name</name>
  <value>hdfs://master:54310</value>
 </property>
Edit configuration file core-site.xml
vi mapred-site.xml
then type:
<property>
  <name>mapred.job.tracker</name>
  <value>master:54311</value>
 </property>
Edit configuration file mapred-site.xml
vi hdfs-site.xml
then type:
<property>
  <name>dfs.replication</name>
  <value>1</value>
  </property>
Edit configuration file hdfs-site.xml

(value=number of slaves)
Now copy /etc/hadoop-0.20/conf.cluster directory to all nodes in your cluster
update-alternatives --install /etc/hadoop-0.20/conf hadoop-0.20-conf /etc/hadoop-0.20/conf.cluster 50 Set alternative rules on all nodes to activate your configuration.
for service in /etc/init.d/hadoop-0.20-*; do sudo $service start; done

for x in /etc/init.d/hadoop-* ; do sudo $x stop ; done
Restart the daemons on all nodes in your cluster using the service scripts so that the new configuration files are read and then stop them
su -s /bin/bash - hdfs -c 'hadoop namenode -format' Format namenode manually(Before starting namenode)
You must run the commands on the correct server, according to your role definition
/etc/init.d/hadoop-0.20-namenode start
 /etc/init.d/hadoop-0.20-secondarynamenode start
/etc/init.d/hadoop-0.20-jobtracker start

To start the daemons on namenode

on master
/etc/init.d/hadoop-0.20-datanode start
/etc/init.d/hadoop-0.20-tasktracker start
To start the daemons on datanode

on slave
Congratulations Cloudera CDH setup is completed

Running Cloudera in Pseudo Distributed Mode

This section contains instructions for Cloudera Distribution for Hadoop (CDH3) installation on ubuntu. This is CDH quickstart tutorial to setup Cloudera Distribution for Hadoop (CDH3) quickly on debian systems. This is shortest tutorial of Cloudera installation, here you will get all the commands and their description required to install Cloudera in Pseudo distributed mode(single node cluster)


Deploy Cloudera (CDH3) in Pseudo Distributed mode:
COMMAND DESCRIPTION
$ sudo add-apt-repository 
"deb http://archive.canonical.com/ lucid partner"
If you are using ubuntu 10.04 LTS run this command 
sudo apt-get install sun-java6-jdk Install java
lsb_release –c Name of the your distribution (let DISTRO)(eg: hardy or jaunty etc.)
vi /etc/apt/sources.list.d/cloudera.list
Then type:
deb http://archive.cloudera.com/debian DISTRO-cdh3 contrib
deb-src http://archive.cloudera.com/debian DISTRO-cdh3 contrib
A repository enables your package manager to install cloudera
replace DISTRO with the name of your distribution
sudo apt-get -y install curl install curl
curl -s http://archive.cloudera.com/debian/archive.key | sudo apt-key add - Add a repository key. Add the Cloudera Public GPG Key to your repository
sudo apt-get update Update APT package index
sudo apt-get -y install hadoop-0.20-conf-pseudo Install Hadoop in pseudo-distributed mode:
A pseudo-distributed Hadoop installation is composed of one node running all five Hadoop daemons: namenode, jobtracker, secondarynamenode, datanode, and tasktracker
for service in /etc/init.d/hadoop-0.20-*; do sudo $service start; done Start the Cloudera Daemons
dpkg -L hadoop-0.20-conf-pseudo Viewing the files on Debian systems
jps It should give output like this:
14799 NameNode
14977 SecondaryNameNode
15183 DataNode
15596 JobTracker
15897 TaskTracker
Congratulations Cloudrea Setup is Completed. Now lets run some examples
hadoop jar /usr/lib/hadoop/hadoop-*-examples.jar pi 10 100 run pi example
hadoop fs -mkdir input
hadoop fs -put /etc/hadoop-0.20/conf/*.xml input
hadoop-0.20 fs -ls input
hadoop-0.20 jar /usr/lib/hadoop-0.20/hadoop-*-examples.jar grep input output 'dfs[a-z.]+'
run grep example
hadoop-0.20 fs -mkdir inputwords
hadoop-0.20 fs -put /etc/hadoop-0.20/conf/*.xml inputwords
hadoop-0.20 fs -ls inputwords
hadoop-0.20 jar /usr/lib/hadoop-0.20/hadoop-*-examples.jar grep inputwords outputwords 'dfs[a-z.]+'
run word count example
http://localhost:50070/ web based interface for name node
http://localhost:50030/ web based interface for Job tracker
for x in /etc/init.d/hadoop-* ; do sudo $x stop ; done Shutdown CDH2 Hadoop services

Running Cloudera in Standalone Mode

This section contains instructions for Cloudera Distribution for Hadoop (CDH3) installation on ubuntu. This is CDH quickstart tutorial to setup Cloudera Distribution for Hadoop (CDH3) quickly on debian systems. This is shortest tutorial of Cloudera installation, here you will get all the commands and their description required to install Cloudera in Standalone mode(single node cluster)


Deploy Cloudera (CDH3) in Standalone mode:
COMMAND DESCRIPTION
$ sudo add-apt-repository 
"deb http://archive.canonical.com/ lucid partner"
If you are using ubuntu 10.04 LTS run this command 
sudo apt-get install sun-java6-jdk Install java
lsb_release –c Name of the your distribution (let DISTRO)(eg: hardy or jaunty or lucid etc.)
vi /etc/apt/sources.list.d/cloudera.list
Then type:
deb http://archive.cloudera.com/debian DISTRO-cdh3 contrib
deb-src http://archive.cloudera.com/debian DISTRO-cdh3 contrib
A repository enables your package manager to install cloudera
replace DISTRO with the name of your distribution
sudo apt-get -y install curl install curl
curl -s http://archive.cloudera.com/debian/archive.key | sudo apt-key add - Add a repository key. Add the Cloudera Public GPG Key to your repository
sudo apt-get update Update APT package index
apt-cache search hadoop List Hadoop packages on Debian systems
apt-get -y install hadoop-0.20 Install hadoop
dpkg -L hadoop-0.20 List the installed files
man hier See that the Hadoop package has been configured
Congratulations Cloudrea Setup is Completed. Now lets run some examples
hadoop jar /usr/lib/hadoop-0.20/hadoop-*-examples.jar pi 10 100 Run pi example
cd /tmp
mkdir input
cp /etc/hadoop/conf/*.xml input
hadoop jar /usr/lib/hadoop-0.20/hadoop-*-examples.jar grep input output 'dfs[a-z.]+'
cat output/*
Run grep example
cd /tmp
mkdir inputwords
cp /etc/hadoop/conf/*.xml inputwords
hadoop jar /usr/lib/hadoop-0.20/hadoop-*-examples.jar wordcount inputwords outputwords
Run word count example