Hadoop Overview

1. Hadoop可以用来做什么
2. Hadoop包括哪些组件
3. CDH和Apache Hadoop的关系
4. CDH产品组件构成
5. CDH产品组件端口分布和配置

Cloudera http://www.cloudera.com/
Apache Hadoop http://hadoop.apache.org/
CDH Downloads https://ccp.cloudera.com/display/SUPPORT/Downloads
CDH Documentation https://ccp.cloudera.com/display/DOC/Documentation
CDH Tutorial https://ccp.cloudera.com/display/SUPPORT/Hadoop+Tutorial

1. Hadoop可以用来做什么

Why Hadoop? http://www.cloudera.com/why-hadoop/

Simply put, Hadoop can transform the way you store and process data throughout your enterprise. According to analysts, about 80% of the data in the world is unstructured, and until Hadoop, it was essentially unusable in any systematic way. With Hadoop, for the first time you can combine all your data and look at it as one.

Make All Your Data Profitable. Hadoop enables you to gain insight from all the data you already have; to ingest the data flowing into your systems 24/7 and leverage it to make optimizations that were impossible before; to make decisions based on hard data, not hunches; to look at complete data, not samples; to look at years of transactions, not days or weeks. In short, Hadoop will change the way you run your organization.
Leverage All Types of Data, From All Types of Systems. Hadoop can handle all types of data from disparate systems: structured, unstructured, log files, pictures, audio files, communications records, email– just about anything you can think of. Even when different types of data have been stored in unrelated systems, you can dump it all into your Hadoop cluster before you even know how you might take advantage of it in the future.
Scale Beyond Anything You Have Today. The largest social network in the world is built on the same open-source technology as Hadoop, and now exceeds 100 petabytes. It’s unlikely your organization has that much data. As you need more capacity, you just add more commodity servers and Hadoop automatically incorporates the new storage and compute capacity.

2. Hadoop包括哪些组件

Apache Hadoop包括了下面这些组件：

Hadoop Common The common utilities that support the other Hadoop subprojects.
Hadoop Distributed File System(HDFS) A distributed file system that provides high-throughput access to application data.
Hadoop MapReduce A software framework for distributed processing of large data sets on compute clusters.

和Apache Hadoop相关的组件有：

Avro A data serialization system.
Cassandra A scalable multi-master database with no single points of failure.
Chukwa A data collection system for managing large distributed systems.
HBase A scalable, distributed database that supports structured data storage for large tables.
Hive A data warehouse infrastructure that provides data summarization and ad hoc querying.
Mahout A Scalable machine learning and data mining library.
Pig A high-level data-flow language and execution framework for parallel computation.
ZooKeeper A high-performance coordination service for distributed applications.

3. CDH和Apache Hadoop的关系

CDH Hadoop FAQ https://ccp.cloudera.com/display/SUPPORT/Hadoop+FAQ

What exactly is included in CDH? / Cloudera's Distribution Including Apache Hadoop (CDH) is a certified release of Apache Hadoop. We include some stable patches scheduled to be included in future releases, as well as some patches we have developed for our supported customers, and are in the process of contributing back to Apache.
What license is Cloudera's Distribution Including Apache Hadoop released under? / Just like Hadoop, Cloudera's Distribution Including Apache Hadoop is released under the Apache Public License version 2.
Is Cloudera forking Hadoop? / Absolutely not. Cloudera is committed to the Hadoop project and the principles of the Apache Software License and Foundation. We continue to work actively with current releases of Hadoop and deliver certified releases to the community as appropriate.
Does Cloudera contribute their changes back to Apache? / We do, and will continue to contribute all eligible changes back to Apache. We occasionally release code we know to be stable even if our contribution to Apache is still in progress. Some of our changes are not eligible for contribution, as they capture the Cloudera brand, or link to our tools and documentation, but these do not affect compatibility with core project.

4. CDH产品组件构成

http://www.cloudera.com/content/cloudera/en/products/cdh.html

从这里可以下载CDH4组件 http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDHTarballs/3.25.2013/CDH4-Downloadable-Tarballs/CDH4-Downloadable-Tarballs.html

5. CDH产品组件端口分布和配置

The CDH4 components, and third parties such as Kerberos, use the ports listed in the tables that follow. Before you deploy CDH4, make sure these ports are open on each system.

5.1. Hadoop HDFS

Service	Qualifier	Port	Protocol	Access Requirement	Configuration	Comment
DataNode		50010	TCP	External	dfs.datanode.address	DataNode HTTP server port
DataNode	Secure	1004	TCP	External	dfs.datanode.address
DataNode		50075	TCP	External	dfs.datanode.http.address
DataNode	Secure	1006	TCP	External	dfs.datanode.http.address
DataNode		50020	TCP	External	dfs.datanode.ipc.address
NameNode		8020	TCP	External	fs.default.name or fs.defaultFS	fs.default.name is deprecated (but still works)
NameNode		50070	TCP	External	dfs.http.address or dfs.namenode.http-address	dfs.http.address is deprecated (but still works)
NameNode	Secure	50470	TCP	External	dfs.https.address or dfs.namenode.https-address	dfs.https.address is deprecated (but still works)
Sec NameNode		50090	TCP	Internal	dfs.secondary.http.address or dfs.namenode.secondary.http-address	dfs.secondary.http.address is deprecated (but still works)
Sec NameNode	Secure	50495	TCP	Internal	dfs.secondary.https.address
JournalNode		8485	TCP	Internal	dfs.namenode.shared.edits.dir
JournalNode		8480	TCP	Internal

5.2. Hadoop MRv1

Service	Qualifier	Port	Protocol	Access Requirement	Configuration	Comment
JobTracker		8021	TCP	External	mapred.job.tracker
JobTracker		50030	TCP	External	mapred.job.tracker.http.address
JobTracker	Thrift Plugin	9290	TCP	Internal	jobtracker.thrift.address	Required by Hue and Cloudera Manager Activity Monitor
TaskTracker		50060	TCP	External	mapred.task.tracker.http.address
TaskTracker		0	TCP	Localhost	mapred.task.tracker.report.address	Communicating with child (umbilical)

5.3. Hadoop YARN

Service	Port	Protocol	Configuration
ResourceManager	8032	TCP	yarn.resourcemanager.address
ResourceManager	8030	TCP	yarn.resourcemanager.scheduler.address
ResourceManager	8031	TCP	yarn.resourcemanager.resource-tracker.address
ResourceManager	8033	TCP	yarn.resourcemanager.admin.address
ResourceManager	8088	TCP	yarn.resourcemanager.webapp.address
NodeManager	8040	TCP	yarn.nodemanager.localizer.address
NodeManager	8042	TCP	yarn.nodemanager.webapp.address
NodeManager	8041	TCP	yarn.nodemanager.address
MapReduce JobHistory Server	10020	TCP	mapreduce.jobhistory.address
MapReduce JobHistory Server	19888	TCP	mapreduce.jobhistory.webapp.address

5.4. HBase

Service	Qualifier	Port	Protocol	Access Requirement	Configuration	Comment
Master		60000	TCP	External	hbase.master.port	IPC
Master		60010	TCP	External	hbase.master.info.port	HTTP
RegionServer		60020	TCP	External	hbase.regionserver.port	IPC
RegionServer		60030	TCP	External	hbase.regionserver.info.port	HTTP
HQuorumPeer		2181	TCP		hbase.zookeeper.property.clientPort	HBase-managed ZK mode
HQuorumPeer		2888	TCP		hbase.zookeeper.peerport	HBase-managed ZK mode
HQuorumPeer		3888	TCP		hbase.zookeeper.leaderport	HBase-managed ZK mode
REST	REST Service	8080	TCP	External	hbase.rest.port
ThriftServer	Thrift Server	9090	TCP	External	Pass -p <port> on CLI
	Avro server	9090	TCP	External	Pass –port <port> on CLI

5.5. Zookeeper

Service	Port	Protocol	Access Requirement	Configuration	Comment
Server (with CDH4 and/or Cloudera Manager 4)	2181	TCP	External	clientPort	Client port
Server (with CDH4 only)	2888	TCP	Internal	X in server.N=host:X:Y	Peer
Server (with CDH4 only)	3888	TCP	Internal	Y in server.N=host:X:Y	Peer
Server (with CDH4 and Cloudera Manager 4)	3181	TCP	Internal	X in server.N=host:X:Y	Peer
Server (with CDH4 and Cloudera Manager 4)	4181	TCP	Internal	Y in server.N=host:X:Y	Peer
ZooKeeper FailoverController (ZKFC)	8019	TCP	Internal		Used for HA
ZooKeeper JMX port	9010	TCP	Internal

As JMX port, ZooKeeper will also use another randomly selected port for RMI. In order for Cloudera Manager to monitor ZooKeeper, you must open up all ports when the connection originates from the Cloudera Manager server.

5.6. 其他组件

Hive

Service	Qualifier	Port	Protocol	Access Requirement	Configuration	Comment
Metastore		9083	TCP	External
HiveServer		10000	TCP	External

Sqoop

Service	Qualifier	Port	Protocol	Access Requirement	Configuration	Comment
Metastore		16000	TCP	External	sqoop.metastore.server.port
Sqoop 2 server		12000	TCP	External

Hue

Service	Port	Protocol	Access Requirement
Server	8888	TCP	External
Beeswax Server	8002		Internal
Beeswax Metastore	8003		Internal

Ozzie

Service	Qualifier	Port	Protocol	Access Requirement	Configuration	Comment
Oozie Server		11000	TCP	External	OOZIE_HTTP_PORT in oozie-env.sh	HTTP
Oozie Server		11001	TCP	localhost	OOZIE_ADMIN_PORT in oozie-env.sh	Shutdown port

Ganglia

Service	Qualifier	Port	Protocol	Access Requirement	Configuration	Comment
ganglia-gmond		8649	UDP/TCP	Internal
ganglia-web		80	TCP	External	Via Apache httpd

Kerberos

Service	Qualifier	Port	Protocol	Access Requirement	Configuration	Comment
KRB5 KDC Server	Secure	88	UDP/TCP	External	kdc_ports and kdc_tcp_ports in either the [kdcdefaults] or [realms] sections of kdc.conf	By default only UDP
KRB5 Admin Server	Secure	749	TCP	Internal	kadmind_port in the [realms] section of kdc.conf