Hadoop Overview

Table of Contents

1. Hadoop可以用来做什么

Why Hadoop? http://www.cloudera.com/why-hadoop/

Simply put, Hadoop can transform the way you store and process data throughout your enterprise. According to analysts, about 80% of the data in the world is unstructured, and until Hadoop, it was essentially unusable in any systematic way. With Hadoop, for the first time you can combine all your data and look at it as one.

  • Make All Your Data Profitable. Hadoop enables you to gain insight from all the data you already have; to ingest the data flowing into your systems 24/7 and leverage it to make optimizations that were impossible before; to make decisions based on hard data, not hunches; to look at complete data, not samples; to look at years of transactions, not days or weeks. In short, Hadoop will change the way you run your organization.
  • Leverage All Types of Data, From All Types of Systems. Hadoop can handle all types of data from disparate systems: structured, unstructured, log files, pictures, audio files, communications records, email– just about anything you can think of. Even when different types of data have been stored in unrelated systems, you can dump it all into your Hadoop cluster before you even know how you might take advantage of it in the future.
  • Scale Beyond Anything You Have Today. The largest social network in the world is built on the same open-source technology as Hadoop, and now exceeds 100 petabytes. It’s unlikely your organization has that much data. As you need more capacity, you just add more commodity servers and Hadoop automatically incorporates the new storage and compute capacity.

2. Hadoop包括哪些组件

Apache Hadoop包括了下面这些组件:

和Apache Hadoop相关的组件有:

  • Avro A data serialization system.
  • Cassandra A scalable multi-master database with no single points of failure.
  • Chukwa A data collection system for managing large distributed systems.
  • HBase A scalable, distributed database that supports structured data storage for large tables.
  • Hive A data warehouse infrastructure that provides data summarization and ad hoc querying.
  • Mahout A Scalable machine learning and data mining library.
  • Pig A high-level data-flow language and execution framework for parallel computation.
  • ZooKeeper A high-performance coordination service for distributed applications.

3. CDH和Apache Hadoop的关系

CDH Hadoop FAQ https://ccp.cloudera.com/display/SUPPORT/Hadoop+FAQ

  • What exactly is included in CDH? / Cloudera's Distribution Including Apache Hadoop (CDH) is a certified release of Apache Hadoop. We include some stable patches scheduled to be included in future releases, as well as some patches we have developed for our supported customers, and are in the process of contributing back to Apache.
  • What license is Cloudera's Distribution Including Apache Hadoop released under? / Just like Hadoop, Cloudera's Distribution Including Apache Hadoop is released under the Apache Public License version 2.
  • Is Cloudera forking Hadoop? / Absolutely not. Cloudera is committed to the Hadoop project and the principles of the Apache Software License and Foundation. We continue to work actively with current releases of Hadoop and deliver certified releases to the community as appropriate.
  • Does Cloudera contribute their changes back to Apache? / We do, and will continue to contribute all eligible changes back to Apache. We occasionally release code we know to be stable even if our contribution to Apache is still in progress. Some of our changes are not eligible for contribution, as they capture the Cloudera brand, or link to our tools and documentation, but these do not affect compatibility with core project.

4. CDH产品组件构成

5. CDH产品组件端口分布和配置

The CDH4 components, and third parties such as Kerberos, use the ports listed in the tables that follow. Before you deploy CDH4, make sure these ports are open on each system.

5.1. Hadoop HDFS

Service Qualifier Port Protocol Access Requirement Configuration Comment
DataNode   50010 TCP External dfs.datanode.address DataNode HTTP server port
DataNode Secure 1004 TCP External dfs.datanode.address  
DataNode   50075 TCP External dfs.datanode.http.address  
DataNode Secure 1006 TCP External dfs.datanode.http.address  
DataNode   50020 TCP External dfs.datanode.ipc.address  
NameNode   8020 TCP External fs.default.name or fs.defaultFS fs.default.name is deprecated (but still works)
NameNode   50070 TCP External dfs.http.address or dfs.namenode.http-address dfs.http.address is deprecated (but still works)
NameNode Secure 50470 TCP External dfs.https.address or dfs.namenode.https-address dfs.https.address is deprecated (but still works)
Sec NameNode   50090 TCP Internal dfs.secondary.http.address or dfs.namenode.secondary.http-address dfs.secondary.http.address is deprecated (but still works)
Sec NameNode Secure 50495 TCP Internal dfs.secondary.https.address  
JournalNode   8485 TCP Internal dfs.namenode.shared.edits.dir  
JournalNode   8480 TCP Internal    

5.2. Hadoop MRv1

Service Qualifier Port Protocol Access Requirement Configuration Comment
JobTracker   8021 TCP External mapred.job.tracker  
JobTracker   50030 TCP External mapred.job.tracker.http.address  
JobTracker Thrift Plugin 9290 TCP Internal jobtracker.thrift.address Required by Hue and Cloudera Manager Activity Monitor
TaskTracker   50060 TCP External mapred.task.tracker.http.address  
TaskTracker   0 TCP Localhost mapred.task.tracker.report.address Communicating with child (umbilical)

5.3. Hadoop YARN

Service Qualifier Port Protocol Access Requirement Configuration Comment
ResourceManager   8032 TCP   yarn.resourcemanager.address  
ResourceManager   8030 TCP   yarn.resourcemanager.scheduler.address  
ResourceManager   8031 TCP   yarn.resourcemanager.resource-tracker.address  
ResourceManager   8033 TCP   yarn.resourcemanager.admin.address  
ResourceManager   8088 TCP   yarn.resourcemanager.webapp.address  
NodeManager   8040 TCP   yarn.nodemanager.localizer.address  
NodeManager   8042 TCP   yarn.nodemanager.webapp.address  
NodeManager   8041 TCP   yarn.nodemanager.address  
MapReduce JobHistory Server   10020 TCP   mapreduce.jobhistory.address  
MapReduce JobHistory Server   19888 TCP   mapreduce.jobhistory.webapp.address  

5.4. HBase

Service Qualifier Port Protocol Access Requirement Configuration Comment
Master   60000 TCP External hbase.master.port IPC
Master   60010 TCP External hbase.master.info.port HTTP
RegionServer   60020 TCP External hbase.regionserver.port IPC
RegionServer   60030 TCP External hbase.regionserver.info.port HTTP
HQuorumPeer   2181 TCP   hbase.zookeeper.property.clientPort HBase-managed ZK mode
HQuorumPeer   2888 TCP   hbase.zookeeper.peerport HBase-managed ZK mode
HQuorumPeer   3888 TCP   hbase.zookeeper.leaderport HBase-managed ZK mode
REST REST Service 8080 TCP External hbase.rest.port  
ThriftServer Thrift Server 9090 TCP External Pass -p <port> on CLI  
  Avro server 9090 TCP External Pass –port <port> on CLI  

5.5. Zookeeper

Service Qualifier Port Protocol Access Requirement Configuration Comment
Server (with CDH4 and/or Cloudera Manager 4)   2181 TCP External clientPort Client port
Server (with CDH4 only)   2888 TCP Internal X in server.N=host:X:Y Peer
Server (with CDH4 only)   3888 TCP Internal Y in server.N=host:X:Y Peer
Server (with CDH4 and Cloudera Manager 4)   3181 TCP Internal X in server.N=host:X:Y Peer
Server (with CDH4 and Cloudera Manager 4)   4181 TCP Internal Y in server.N=host:X:Y Peer
ZooKeeper FailoverController (ZKFC)   8019 TCP Internal   Used for HA
ZooKeeper JMX port   9010 TCP Internal    

As JMX port, ZooKeeper will also use another randomly selected port for RMI. In order for Cloudera Manager to monitor ZooKeeper, you must open up all ports when the connection originates from the Cloudera Manager server.

5.6. 其他组件

Hive

Service Qualifier Port Protocol Access Requirement Configuration Comment
Metastore   9083 TCP External  
HiveServer   10000 TCP External  

Sqoop

Service Qualifier Port Protocol Access Requirement Configuration Comment
Metastore   16000 TCP External sqoop.metastore.server.port
Sqoop 2 server   12000 TCP External  

Hue

Service Qualifier Port Protocol Access Requirement Configuration Comment
Server   8888 TCP External    
Beeswax Server   8002   Internal    
Beeswax Metastore   8003   Internal    

Ozzie

Service Qualifier Port Protocol Access Requirement Configuration Comment
Oozie Server   11000 TCP External OOZIE_HTTP_PORT in oozie-env.sh HTTP
Oozie Server   11001 TCP localhost OOZIE_ADMIN_PORT in oozie-env.sh Shutdown port

Ganglia

Service Qualifier Port Protocol Access Requirement Configuration Comment
ganglia-gmond   8649 UDP/TCP Internal    
ganglia-web   80 TCP External Via Apache httpd  

Kerberos

Service Qualifier Port Protocol Access Requirement Configuration Comment
KRB5 KDC Server Secure 88 UDP/TCP External kdc_ports and kdc_tcp_ports in either the [kdcdefaults] or [realms] sections of kdc.conf By default only UDP
KRB5 Admin Server Secure 749 TCP Internal kadmind_port in the [realms] section of kdc.conf