Running Cloudera in Distributed Mode

This section contains instructions for Cloudera Distribution for Hadoop (CDH3) installation on ubuntu. This is CDH quickstart tutorial to setup Cloudera Distribution for Hadoop (CDH3) quickly on debian systems. This is shortest tutorial of Cloudera installation, here you will get all the commands and their description required to install Cloudera in Distributed mode(multi node cluster)

Prerequisite: Before starting Cloudera in distributed mode you must setup Cloudera in pseudo distributed mode and you need at least two machines one for master and another for slave(you can create more then one virtual machine(cluster) on a single machine).


Deploy Cloudera (CDH3) on Cluster:
COMMAND DESCRIPTION
for x in /etc/init.d/hadoop-* ; do sudo $x stop ; done Before starting Cloudera in distributed mode first stop each cluster
update-alternatives --display hadoop-0.20-conf To list alternative Hadoop configurations on Your system
cp -r /etc/hadoop-0.20/conf.empty /etc/hadoop-0.20/conf.cluster Copy the default configuration to your custom directory
update-alternatives --install /etc/hadoop-0.20/conf hadoop-0.20-conf /etc/hadoop-0.20/conf.cluster 50 To activate the new configuration on Your systems
update-alternatives --display hadoop-0.20-conf To Check the new configuration on Your systems
or
update-alternatives --set hadoop-0.20-conf /etc/hadoop-0.20/conf.cluster
To manually set the configuration
vi /etc/hosts Then type
IP-add master(eg: 192.168.0.1 master)
IP-add slave(eg: 192.168.0.2 slave)
sudo apt-get install openssh-server openssh-client install ssh
ssh-keygen -t rsa -P "" generating rsa key for passwordless ssh
ssh-copy-id -i $HOME/.ssh/id_rsa.pub slave setting passwordless ssh
Now go to your custom directory (conf.cluster) and change configuration files
vi masters
then erase old contents and type master
masters file defines the namenodes of our multi-node cluster
vi slaves
then erase old contents and type slave
slaves file lists the hosts, one per line, where the Hadoop slave daemons (datanodes and tasktrackers) will be run.
vi core-site.xml
then type:
<property>
  <name>fs.default.name</name>
  <value>hdfs://master:54310</value>
 </property>
Edit configuration file core-site.xml
vi mapred-site.xml
then type:
<property>
  <name>mapred.job.tracker</name>
  <value>master:54311</value>
 </property>
Edit configuration file mapred-site.xml
vi hdfs-site.xml
then type:
<property>
  <name>dfs.replication</name>
  <value>1</value>
  </property>
Edit configuration file hdfs-site.xml

(value=number of slaves)
Now copy /etc/hadoop-0.20/conf.cluster directory to all nodes in your cluster
update-alternatives --install /etc/hadoop-0.20/conf hadoop-0.20-conf /etc/hadoop-0.20/conf.cluster 50 Set alternative rules on all nodes to activate your configuration.
for service in /etc/init.d/hadoop-0.20-*; do sudo $service start; done

for x in /etc/init.d/hadoop-* ; do sudo $x stop ; done
Restart the daemons on all nodes in your cluster using the service scripts so that the new configuration files are read and then stop them
su -s /bin/bash - hdfs -c 'hadoop namenode -format' Format namenode manually(Before starting namenode)
You must run the commands on the correct server, according to your role definition
/etc/init.d/hadoop-0.20-namenode start
 /etc/init.d/hadoop-0.20-secondarynamenode start
/etc/init.d/hadoop-0.20-jobtracker start

To start the daemons on namenode

on master
/etc/init.d/hadoop-0.20-datanode start
/etc/init.d/hadoop-0.20-tasktracker start
To start the daemons on datanode

on slave
Congratulations Cloudera CDH setup is completed

54 comments:

  1. Any Feedback and suggestions are invited

    ReplyDelete
    Replies
    1. Hi, i am new to hadoop. when i install cloudera manager on ubuntu 10.4,i got an error ;

      1st step--chmod a+x cloudera-manager-installer.bin

      2nd step--sudo ./cloudera-manager-installer.bin

      Then i got an error like this

      ./cloudera-manager-installer.bin:1.Syntax error ")" unexpected

      please give me your suggestion..

      Thanking you

      Delete
    2. Ubuntu is not supported OS for Cloudera manager
      you can use CentOS

      Delete
  2. Hi Rahul,

    I have configured my Cluster with above specifications. But I am not able test it with any of the example given in pseudo examples.
    It is throwing some ACL errors and exceptions.
    Any idea?

    Thanks,
    Vishwesh

    ReplyDelete
  3. Hi Vishwesh,
    did you enabled acls?
    When acls are enabled on the job tracker using the property mapred.acls.enabled, and a job is submitted to a queue name that does not exist in mapred.queue.names property, the following exception is thrown:
    org.apache.hadoop.security.AccessControlException: org.apache.hadoop.security.AccessControlException

    ReplyDelete
  4. Hi Rahul,

    Thanks for your response.
    But how to disable acls on CDH Cluster?

    ReplyDelete
  5. Hi Vishwesh,
    I think The problem you are getting is not the issue of CDH.
    Talking about CDH you can Specifies whether ACLs are enabled, and should be checked for various operations by:

    mapred.acls.enabled
    false


    and in ubuntu you can use setfacl command

    ReplyDelete
  6. I did all the settings and got some permission denied problems when I start the job tracker.

    I got the following error when I try to start the jobtracker.

    2011-05-08 10:11:36,200 WARN org.apache.hadoop.mapred.JobTracker: Failed to operate on mapred.system.dir (hdfs://master:54310/tmp/hadoop-mapred/mapred/system) because of permissions.
    2011-05-08 10:11:36,200 WARN org.apache.hadoop.mapred.JobTracker: This directory should be owned by the user 'mapred'
    2011-05-08 10:11:36,201 WARN org.apache.hadoop.mapred.JobTracker: Bailing out ...
    org.apache.hadoop.security.AccessControlException: org.apache.hadoop.security.AccessControlException: Permission denied: user=mapred, access=WRITE, inode="":hdfs:supergroup:rwxr-xr-x

    I think I need to setup the following properties in the mapred-site.xml
    Can you give us an example of the following,
    mapred.local.dir
    Determines where temporary MapReduce data is written. It also may be a list of directories.
    mapred.map.tasks
    As a rule of thumb, use 10x the number of slaves (i.e., number of tasktrackers).
    mapred.reduce.tasks
    As a rule of thumb, use 2x the number of slave processors (i.e., number of tasktrackers).

    ReplyDelete
  7. Tnx for posting it.
    Why don't you use master machine as datanode and jobtracker too? But of course, it's reasonable if you have less than 10 machines(roughly) in cluster.

    ReplyDelete
  8. You can use master as slave by starting datanode and tasktracker daemons on master and putting entry in slaves.

    It depends on requirements to add master as slave or not.
    If you have small cluster then you can do this, but if cluster size is large you should not add master as slave

    ReplyDelete
  9. sorry but maybe could you be more specific the about the configuration where the set up must do on master only, slave only or both of master n slave.
    sorry i'm still new on hadoop :D

    ReplyDelete
  10. Your blog has helped me a lot. Thank you very much.

    One addition:
    1. you need to move the hosts file to your slave. if your network doesnt identity machines "master" or "slave"

    ReplyDelete
    Replies
    1. also dfs.replication property means how many copies you want to have of your data. if you have 10gb data to be uploaded to dfs then you would need in total of 30gb dfs space (in total on all your datanodes)

      Delete
    2. you can also use IP ADD at the place of "master" and "slave"

      Delete
    3. yes 30 GB of dfs space is required for 10 GB of data, if replication factor is 3

      Delete
  11. Hi, i am new to hadoop. when i install cloudera manager on ubuntu 10.4,i got an error ;

    1st step--chmod a+x cloudera-manager-installer.bin

    2nd step--sudo ./cloudera-manager-installer.bin

    Then i got an error like this

    ./cloudera-manager-installer.bin:1.Syntax error ")" unexpected

    please give me your suggestion..

    Thanking you

    ReplyDelete
    Replies
    1. Hi,
      cloudera manager does not work on Ubuntu
      It can work on CentOS

      but if you want to install hadoop on ubuntu, you can install it without cloudra manager, by following above steps

      Delete
    2. Thanks Rahul,
      can u please tell me how we access HBase using Hadoopclusters in cloudera manager for redhat linux.
      Please send me materials or any stuff related to above.
      My mail id is : vaddi.ramu33@gmail.com
      Pl send suggestions also for me as i am new to HADOOP.

      Delete
    3. Hi,
      for hbase you can refer
      http://ankitasblogger.blogspot.in/2011/01/installing-hbase-in-cluster-complete.html

      Delete
  12. hi rahul,

    i am unable to setup cloudera manager, so can u suggest me any tutorials about clouderamanager except cloudera.com

    ReplyDelete
  13. i installed successfully.
    after that i done clientconfiguration in clouderaagents
    and in clouderaserver host
    when i enterd localhost:50070
    it shows name node
    after that when i enterd localhost:60010
    it showing "page not Found Error"
    please can give some sought of suggestions to clear this error.
    what version of eclipse can i install on centOS5.3 to do projects using phython

    ReplyDelete
    Replies
    1. As far as I know no services run on localhost:60010

      Services mainly run on
      50070
      50060
      50030

      Delete
  14. hi how can we set client configuration please tell me briefly with an eg

    ReplyDelete
    Replies
    1. I am not able to get what configuration you are asking

      Delete
  15. Hi Rahul,
    I am working on Hadoop and my team members are all freshers to hadoop.
    Can u suggest some books which will be easy to code
    and also send some site & books which are related to HBase commands

    ReplyDelete
    Replies
    1. I think "Hadoop_The_Definitive_Guide_Cr.pdf" is good book for startup
      also lots of contents available on net

      Delete
  16. Hi Rahul,
    I Installed cloudera manager sucessfully.

    After that when i am check the status of hbase master. I got following error


    Traceback (most recent call last):
    File "/usr/lib64/cmf/agent/src/cmf/monitor/master/__init__.py", line 87, in collect
    json = simplejson.load(urllib2.urlopen(self._metrics_url))
    File "/usr/lib64/python2.4/urllib2.py", line 130, in urlopen
    return _opener.open(url, data)
    File "/usr/lib64/python2.4/urllib2.py", line 358, in open
    response = self._open(req, data)
    File "/usr/lib64/python2.4/urllib2.py", line 376, in _open
    '_open', req)
    File "/usr/lib64/python2.4/urllib2.py", line 337, in _call_chain
    result = func(*args)
    File "/usr/lib64/python2.4/urllib2.py", line 1032, in http_open
    return self.do_open(httplib.HTTPConnection, req)
    File "/usr/lib64/python2.4/urllib2.py", line 1006, in do_open
    raise URLError(err)
    URLError:

    ReplyDelete
  17. Hi Rahul
    when i check localhost:50070

    it showing that nodes as deadnodes
    in Live nodes it showing as 0
    pl resolve problem

    ReplyDelete
  18. I think your datanode daemons are not running, please check logs

    If datanode is running then please run following commands:

    $ bin/hadoop dfsadmin -refreshNodes
    $ bin/hadoop fsck /

    ReplyDelete
    Replies
    1. hi rahul

      i want to chat with u please come to gmail & accept my chat request
      i have so many doubts .Its very urgent we are struggling alot
      Pls help me out

      Delete
  19. Hi rahul,

    I started cloudera manager. It showing all nodes are in good state
    but wen i am trying to view status of particular datanode or hbase region server

    it getting error
    & showing all nodes as a dead nodes

    Please tell me how to configure the client configuration

    ReplyDelete
    Replies
    1. I dont use cloudera manager that much, I feel comfortable in manually installation, so cannot say any thing what or why cloudera manager is giving different result.
      What result hadoop Web interface is giving I feel thats correct

      also please scan your log files to find exact problem

      Delete
  20. Hi Rahul,
    how to set /etc/hosts file

    I configured as
    In Master System:

    127.0.0.1 localhost
    192.168.1.13 hadoop1.com
    192.168.1.12 hadoop2.com
    192.168.1.16 hadoop4.com
    192.168.1.49 hadoop3.com

    In Slave System------hadoop2.com

    127.0.0.1 localhost
    192.168.1.12 hadoop2.com
    192.168.1.13 hadoop1.com


    when i trying following command

    host -v -t A 'hadoop1.com'

    It taking global IP instead of local IP

    Pls resolve this

    ReplyDelete
  21. Its not compulsory to put entry in /etc/hosts
    Its just for your convenience

    if you put 192.168.1.13 hadoop1.com on a node, then run following command to check it:
    ping hadoop1.com
    or ssh hadoop1.com

    ReplyDelete
  22. Hi rahul

    Thanks a lot

    when i am connected hbase with one slave & at the same time when i am trying to connect hbase with another slave it showing error in 2nd slave as

    INFO ipc:HbaceRPC: server at phxl-ss-2-lb.cnet.com/64.30.224.1
    could not be reached

    But it working on slave 1

    we didnt have ip as 64.30.224.1 in any of our system

    ReplyDelete
  23. Hi Rahul

    i am getting FATAL Error In hbase service of one client

    How to check the data in hbase is distributed or not.

    Pls Help me out

    ReplyDelete
    Replies
    1. for HBase related queries please posts comments in

      http://ankitasblogger.blogspot.in/2011/01/installing-hbase-in-cluster-complete.html

      Delete
  24. Hi Rahul

    thanks for helping me in cluster setup.

    I am thinking to do map reduce programs in python.

    How to do & what are the resources will be use.

    Pls send me any materials regarding MAPREDUCE IN PYTHON.

    ReplyDelete
  25. Hi,

    I notice that there are some steps in

    https://ccp.cloudera.com/display/CDHDOC/CDH3+Deployment+on+a+Cluster#CDH3DeploymentonaCluster-ConfigurationFilesandProperties

    that you do not include in your tutorial. (An example is the configuration of local storage directories. Another difference is that the Cloudera info indicates that one must use fully-qualified domain names which your tutorial does not seem to require.)

    Can you comment upon these steps? Are they not required but nice to have?

    ReplyDelete
  26. Hi,
    The configuration specified above are minimum required,
    But yes some of the configuration parameters specified on the link you gave, should be considered, its for the good practices.

    Thanks for pointing out this, I will update the tutorial

    ReplyDelete
  27. Hi Rahul,
    I am facing problem in starting hbase-master in CDH-4 YARN. the webpage localhost:60010 is not opening. I have followed installation procedure as mentioned in CDH4 installation guide from cloudera for standalone pseudo mode. Once I have purged hbase and may be there are some configuration change when I reinstalled it. Then it was working well.

    ReplyDelete
  28. Took me time to read all the comments, but I really enjoyed the article. It proved to be Very helpful to me and I am sure to all the commenters here! Its always nice when you can not only be informed, but also entertained! Im sure you had fun writing this article.

    Hadoop online training

    ReplyDelete
  29. Hey very nice blog!!
    Hi there,I enjoy reading through your article post, I wanted to write a little comment to support you and wish you a good

    continuationAll the best for all your blogging efforts.
    Appreciate the recommendation! Let me try it out.
    Keep working ,great job!
    Hadoop training

    ReplyDelete
  30. Hadoop is a open source framework which is written in java by apche
    software foundation.Hadoop Tutorial

    ReplyDelete
  31. Excellent information.i have learn to this info.Thank you so much.
    Hadoop Training in Chennai

    ReplyDelete
  32. Nice and good article. It is very useful for me to learn and understand easily. Thanks for sharing your valuable information and time. Please keep updating Hadoop Admin Online Training Hyderabad

    ReplyDelete
  33. Congenital Diaphragmatic Hernia
    The diaphragm typically forms during the first eight weeks of pregnancy. In CDH patients, the size of the hole in the diaphragm will determine how much a baby’s lungs, heart, and other internal organs will be affected.

    ReplyDelete