Hive Getting Started

Table of Contents

https://cwiki.apache.org/confluence/display/Hive/GettingStarted

首先配置好hdfs/mapred(spark), 然后配置hive. hive使用hdfs做文件存储, 使用mapred(spark)做计算引擎

1. Configuration

  • 默认配置文件是 <install-dir>/conf/hive-default.xml
  • 可以通过配置文件改写 <install-dir>/conf/hive-site.xml
  • 配置文件路径 HIVE_CONF_DIR
  • 日志配置文件 <install-dir>/conf/hive-log4j.properties
  • 启动通过 bin/hive -hiveconf x1=y1 -hiveconf x2=y2 设置参数
  • 运行时候通过 SET mapred.job.tracker=myhost.mycompany.com:50030; 修改参数

2. Local Mode

这里所谓的local-mode主要是指运行的mapreduce是在local node上面完成的,至于数据源还是和hdfs/hbase本身配置相关。可以通过设置 SET mapred.job.tracker=local; 强制修改mapreduce本地完成。

hive0.7以后提供自动切换local-mode功能,设置 hive> SET hive.exec.mode.local.auto=false; 那么对于下面三个情况满足的条件下就会自动切换到local-mode:

  • The total input size of the job is lower than: hive.exec.mode.local.auto.inputbytes.max (128MB by default)
  • The total number of map-tasks is less than: hive.exec.mode.local.auto.tasks.max (4 by default)
  • The total number of reduce tasks required is 1 or 0.

3. Metadata Store

https://cwiki.apache.org/confluence/display/Hive/AdminManual+MetastoreAdmin

metadata store主要是用来存储数据库的一些元数据信息,有下面相关的配置参数:

  • javax.jdo.option.ConnectionURL.
  • javax.jdo.option.ConnectionDriverName.

默认实现是在本地的derby db,默认存储位置是./metastore_db. metastore其他实现需要支持JPO(Java Persistent Object) Metastore can be stored in any database that is supported by JPOX. The database schema is defined in JDO metadata annotations file package.jdo at src/contrib/hive/metastore/src/model.

当然也可以将这些数据存储在远程数据库上。remote metadata store server和client之间交互是通过thrift完成的,thrift server通过jdbc连接到mysql或者是其他数据库上。

If you are using MySQL as the datastore for metadata, put MySQL client libraries in HIVE_HOME/lib before starting Hive Client or HiveMetastore Server. 或者如果使用ubuntu的话,可以直接使用 sudo apt-get install libmysql-java 安装,然后jar都在/usr/share/java下面。

Server Configuration Parameters

Config Param Config Value Comment
javax.jdo.option.ConnectionURL jdbc:mysql://<host name>/<database name>?createDatabaseIfNotExist=true metadata is stored in a MySQL server
javax.jdo.option.ConnectionDriverName com.mysql.jdbc.Driver MySQL JDBC driver class
javax.jdo.option.ConnectionUserName <user name> user name for connecting to mysql server
javax.jdo.option.ConnectionPassword <password> password for connecting to mysql server
hive.metastore.warehouse.dir <base hdfs path> default location for Hive tables.

Client Configuration Parameters

Config Param Config Value Comment
hive.metastore.uris thrift://<host_name>:<port> host and port for the thrift metastore server
hive.metastore.local false this is local store
hive.metastore.warehouse.dir <base hdfs path> default location for Hive tables.

thrift server 通过 hive –service metastore 启动,port在9083上面. 端口可以通过-p选项来指定, 或是从环境变量METASTORE_PORT来获得(hive-env.sh里面可以设置).

13/03/07 18:06:34 INFO metastore.HiveMetaStore: Started the new metaserver on port [9083]...
13/03/07 18:06:34 INFO metastore.HiveMetaStore: Options.minWorkerThreads = 200
13/03/07 18:06:34 INFO metastore.HiveMetaStore: Options.maxWorkerThreads = 100000
13/03/07 18:06:34 INFO metastore.HiveMetaStore: TCP keepalive = true

配置文件如下

<configuration>
  <property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:mysql://localhost/hivemeta?createDatabaseIfNotExist=true</value>
    <description>JDBC connect string for a JDBC metastore</description>
  </property>

  <property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>com.mysql.jdbc.Driver</value>
    <description>Driver class name for a JDBC metastore</description>
  </property>

  <property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>root</value>
    <description>username to use against metastore database</description>
  </property>

  <property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>123456</value>
    <description>password to use against metastore database</description>
  </property>

  <property>
    <name>hive.metastore.warehouse.dir</name>
    <value>/user/hive/warehouse</value>
    <description>location of default database for the warehouse</description>
  </property>

  <property>
    <name>hive.metastore.uris</name>
    <value>thrift://localhost:9083</value>
    <description>Thrift URI for the remote metastore. Used by metastore client to connect to remote metastore.</description>
  </property>

</configuration>

4. Example

数据默认是使用ctrl-a来做分割

➜  bin  hadoop fs -copyFromLocal ../examples/files/kv1.txt /tmp/
13/03/07 14:34:40 INFO security.UserGroupInformation: JAAS Configuration already set up for Hadoop, not re-installing.
➜  bin  hive
Hive history file=/tmp/dirlt/hive_job_log_dirlt_201303071434_1408198373.txt
hive> DROP TABLE kv;
OK
Time taken: 4.647 seconds
hive> CREATE TABLE kv (k INT,v STRING);
OK
Time taken: 0.201 seconds
hive> LOAD DATA INPATH '/tmp/kv1.txt' OVERWRITE INTO TABLE kv;
Loading data to table default.kv
Moved to trash: hdfs://localhost:9000/home/dirlt/hive/warehouse/kv
OK
Time taken: 0.225 seconds
hive> SELECT * from kv WHERE k = 417;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201303071324_0006, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201303071324_0006
Kill Command = /home/dirlt/utils/hadoop-0.20.2-cdh3u3//bin/hadoop job  -Dmapred.job.tracker=localhost:9001 -kill job_201303071324_0006
2013-03-07 14:36:14,960 Stage-1 map = 0%,  reduce = 0%
2013-03-07 14:36:16,970 Stage-1 map = 100%,  reduce = 0%
2013-03-07 14:36:17,982 Stage-1 map = 100%,  reduce = 100%
Ended Job = job_201303071324_0006
OK
417     val_417
417     val_417
417     val_417
Time taken: 5.787 seconds

整个流程下来分为四个部分:

  • copy to hdfs
  • create table.
  • load data.
  • do select (看到这里运行了mr任务)

上面例子是使用文本数据. 这里有个 例子 如何使用avro数据.