Hadoop Overview
Table of Contents
- Cloudera http://www.cloudera.com/
- Apache Hadoop http://hadoop.apache.org/
- CDH Downloads https://ccp.cloudera.com/display/SUPPORT/Downloads
- CDH Documentation https://ccp.cloudera.com/display/DOC/Documentation
- CDH Tutorial https://ccp.cloudera.com/display/SUPPORT/Hadoop+Tutorial
1. Hadoop可以用来做什么
Why Hadoop? http://www.cloudera.com/why-hadoop/
Simply put, Hadoop can transform the way you store and process data throughout your enterprise. According to analysts, about 80% of the data in the world is unstructured, and until Hadoop, it was essentially unusable in any systematic way. With Hadoop, for the first time you can combine all your data and look at it as one.
- Make All Your Data Profitable. Hadoop enables you to gain insight from all the data you already have; to ingest the data flowing into your systems 24/7 and leverage it to make optimizations that were impossible before; to make decisions based on hard data, not hunches; to look at complete data, not samples; to look at years of transactions, not days or weeks. In short, Hadoop will change the way you run your organization.
- Leverage All Types of Data, From All Types of Systems. Hadoop can handle all types of data from disparate systems: structured, unstructured, log files, pictures, audio files, communications records, email– just about anything you can think of. Even when different types of data have been stored in unrelated systems, you can dump it all into your Hadoop cluster before you even know how you might take advantage of it in the future.
- Scale Beyond Anything You Have Today. The largest social network in the world is built on the same open-source technology as Hadoop, and now exceeds 100 petabytes. It’s unlikely your organization has that much data. As you need more capacity, you just add more commodity servers and Hadoop automatically incorporates the new storage and compute capacity.
2. Hadoop包括哪些组件
Apache Hadoop包括了下面这些组件:
- Hadoop Common The common utilities that support the other Hadoop subprojects.
- Hadoop Distributed File System(HDFS) A distributed file system that provides high-throughput access to application data.
- Hadoop MapReduce A software framework for distributed processing of large data sets on compute clusters.
和Apache Hadoop相关的组件有:
- Avro A data serialization system.
- Cassandra A scalable multi-master database with no single points of failure.
- Chukwa A data collection system for managing large distributed systems.
- HBase A scalable, distributed database that supports structured data storage for large tables.
- Hive A data warehouse infrastructure that provides data summarization and ad hoc querying.
- Mahout A Scalable machine learning and data mining library.
- Pig A high-level data-flow language and execution framework for parallel computation.
- ZooKeeper A high-performance coordination service for distributed applications.
3. CDH和Apache Hadoop的关系
CDH Hadoop FAQ https://ccp.cloudera.com/display/SUPPORT/Hadoop+FAQ
- What exactly is included in CDH? / Cloudera's Distribution Including Apache Hadoop (CDH) is a certified release of Apache Hadoop. We include some stable patches scheduled to be included in future releases, as well as some patches we have developed for our supported customers, and are in the process of contributing back to Apache.
- What license is Cloudera's Distribution Including Apache Hadoop released under? / Just like Hadoop, Cloudera's Distribution Including Apache Hadoop is released under the Apache Public License version 2.
- Is Cloudera forking Hadoop? / Absolutely not. Cloudera is committed to the Hadoop project and the principles of the Apache Software License and Foundation. We continue to work actively with current releases of Hadoop and deliver certified releases to the community as appropriate.
- Does Cloudera contribute their changes back to Apache? / We do, and will continue to contribute all eligible changes back to Apache. We occasionally release code we know to be stable even if our contribution to Apache is still in progress. Some of our changes are not eligible for contribution, as they capture the Cloudera brand, or link to our tools and documentation, but these do not affect compatibility with core project.
4. CDH产品组件构成
5. CDH产品组件端口分布和配置
The CDH4 components, and third parties such as Kerberos, use the ports listed in the tables that follow. Before you deploy CDH4, make sure these ports are open on each system.
5.1. Hadoop HDFS
Service | Qualifier | Port | Protocol | Access Requirement | Configuration | Comment |
---|---|---|---|---|---|---|
DataNode | 50010 | TCP | External | dfs.datanode.address | DataNode HTTP server port | |
DataNode | Secure | 1004 | TCP | External | dfs.datanode.address | |
DataNode | 50075 | TCP | External | dfs.datanode.http.address | ||
DataNode | Secure | 1006 | TCP | External | dfs.datanode.http.address | |
DataNode | 50020 | TCP | External | dfs.datanode.ipc.address | ||
NameNode | 8020 | TCP | External | fs.default.name or fs.defaultFS | fs.default.name is deprecated (but still works) | |
NameNode | 50070 | TCP | External | dfs.http.address or dfs.namenode.http-address | dfs.http.address is deprecated (but still works) | |
NameNode | Secure | 50470 | TCP | External | dfs.https.address or dfs.namenode.https-address | dfs.https.address is deprecated (but still works) |
Sec NameNode | 50090 | TCP | Internal | dfs.secondary.http.address or dfs.namenode.secondary.http-address | dfs.secondary.http.address is deprecated (but still works) | |
Sec NameNode | Secure | 50495 | TCP | Internal | dfs.secondary.https.address | |
JournalNode | 8485 | TCP | Internal | dfs.namenode.shared.edits.dir | ||
JournalNode | 8480 | TCP | Internal |
5.2. Hadoop MRv1
Service | Qualifier | Port | Protocol | Access Requirement | Configuration | Comment |
---|---|---|---|---|---|---|
JobTracker | 8021 | TCP | External | mapred.job.tracker | ||
JobTracker | 50030 | TCP | External | mapred.job.tracker.http.address | ||
JobTracker | Thrift Plugin | 9290 | TCP | Internal | jobtracker.thrift.address | Required by Hue and Cloudera Manager Activity Monitor |
TaskTracker | 50060 | TCP | External | mapred.task.tracker.http.address | ||
TaskTracker | 0 | TCP | Localhost | mapred.task.tracker.report.address | Communicating with child (umbilical) |
5.3. Hadoop YARN
Service | Qualifier | Port | Protocol | Access Requirement | Configuration | Comment |
---|---|---|---|---|---|---|
ResourceManager | 8032 | TCP | yarn.resourcemanager.address | |||
ResourceManager | 8030 | TCP | yarn.resourcemanager.scheduler.address | |||
ResourceManager | 8031 | TCP | yarn.resourcemanager.resource-tracker.address | |||
ResourceManager | 8033 | TCP | yarn.resourcemanager.admin.address | |||
ResourceManager | 8088 | TCP | yarn.resourcemanager.webapp.address | |||
NodeManager | 8040 | TCP | yarn.nodemanager.localizer.address | |||
NodeManager | 8042 | TCP | yarn.nodemanager.webapp.address | |||
NodeManager | 8041 | TCP | yarn.nodemanager.address | |||
MapReduce JobHistory Server | 10020 | TCP | mapreduce.jobhistory.address | |||
MapReduce JobHistory Server | 19888 | TCP | mapreduce.jobhistory.webapp.address |
5.4. HBase
Service | Qualifier | Port | Protocol | Access Requirement | Configuration | Comment |
---|---|---|---|---|---|---|
Master | 60000 | TCP | External | hbase.master.port | IPC | |
Master | 60010 | TCP | External | hbase.master.info.port | HTTP | |
RegionServer | 60020 | TCP | External | hbase.regionserver.port | IPC | |
RegionServer | 60030 | TCP | External | hbase.regionserver.info.port | HTTP | |
HQuorumPeer | 2181 | TCP | hbase.zookeeper.property.clientPort | HBase-managed ZK mode | ||
HQuorumPeer | 2888 | TCP | hbase.zookeeper.peerport | HBase-managed ZK mode | ||
HQuorumPeer | 3888 | TCP | hbase.zookeeper.leaderport | HBase-managed ZK mode | ||
REST | REST Service | 8080 | TCP | External | hbase.rest.port | |
ThriftServer | Thrift Server | 9090 | TCP | External | Pass -p <port> on CLI | |
Avro server | 9090 | TCP | External | Pass –port <port> on CLI |
5.5. Zookeeper
Service | Qualifier | Port | Protocol | Access Requirement | Configuration | Comment |
---|---|---|---|---|---|---|
Server (with CDH4 and/or Cloudera Manager 4) | 2181 | TCP | External | clientPort | Client port | |
Server (with CDH4 only) | 2888 | TCP | Internal | X in server.N=host:X:Y | Peer | |
Server (with CDH4 only) | 3888 | TCP | Internal | Y in server.N=host:X:Y | Peer | |
Server (with CDH4 and Cloudera Manager 4) | 3181 | TCP | Internal | X in server.N=host:X:Y | Peer | |
Server (with CDH4 and Cloudera Manager 4) | 4181 | TCP | Internal | Y in server.N=host:X:Y | Peer | |
ZooKeeper FailoverController (ZKFC) | 8019 | TCP | Internal | Used for HA | ||
ZooKeeper JMX port | 9010 | TCP | Internal |
As JMX port, ZooKeeper will also use another randomly selected port for RMI. In order for Cloudera Manager to monitor ZooKeeper, you must open up all ports when the connection originates from the Cloudera Manager server.
5.6. 其他组件
Hive
Service | Qualifier | Port | Protocol | Access Requirement | Configuration | Comment |
---|---|---|---|---|---|---|
Metastore | 9083 | TCP | External | |||
HiveServer | 10000 | TCP | External |
Sqoop
Service | Qualifier | Port | Protocol | Access Requirement | Configuration | Comment |
---|---|---|---|---|---|---|
Metastore | 16000 | TCP | External | sqoop.metastore.server.port | ||
Sqoop 2 server | 12000 | TCP | External |
Hue
Service | Qualifier | Port | Protocol | Access Requirement | Configuration | Comment |
---|---|---|---|---|---|---|
Server | 8888 | TCP | External | |||
Beeswax Server | 8002 | Internal | ||||
Beeswax Metastore | 8003 | Internal |
Ozzie
Service | Qualifier | Port | Protocol | Access Requirement | Configuration | Comment |
---|---|---|---|---|---|---|
Oozie Server | 11000 | TCP | External | OOZIE_HTTP_PORT in oozie-env.sh | HTTP | |
Oozie Server | 11001 | TCP | localhost | OOZIE_ADMIN_PORT in oozie-env.sh | Shutdown port |
Ganglia
Service | Qualifier | Port | Protocol | Access Requirement | Configuration | Comment |
---|---|---|---|---|---|---|
ganglia-gmond | 8649 | UDP/TCP | Internal | |||
ganglia-web | 80 | TCP | External | Via Apache httpd |
Kerberos
Service | Qualifier | Port | Protocol | Access Requirement | Configuration | Comment |
---|---|---|---|---|---|---|
KRB5 KDC Server | Secure | 88 | UDP/TCP | External | kdc_ports and kdc_tcp_ports in either the [kdcdefaults] or [realms] sections of kdc.conf | By default only UDP |
KRB5 Admin Server | Secure | 749 | TCP | Internal | kadmind_port in the [realms] section of kdc.conf |