Hive-1.2.0学习笔记(二)Hive数据类型

鲁春利的工作笔记,谁说程序员不能有文艺范?

创新互联建站-专业网站定制、快速模板网站建设、高性价比本溪网站开发、企业建站全套包干低至880元,成熟完善的模板库,直接使用。一站式本溪网站制作公司更省心,省钱,快速模板网站建设找我们,业务覆盖本溪地区。费用合理售后完善,10余年实体公司更值得信赖。


在hive中创建mywork数据库,以后的测试在该数据库中进行,避免每次都使用default数据库。

hive> create database mywork;
OK
Time taken: 0.487 seconds
hive> show databases;
OK
default
mywork
Time taken: 0.614 seconds, Fetched: 2 row(s)
hive> 
hive> use mywork;
OK
Time taken: 0.064 seconds
hive> create table student(id int, name string);
OK
Time taken: 0.519 seconds
hive>

查看Hive在HDFS上的存储

[hadoop@dnode1 ~]$ hdfs dfs -ls -R /user/hive
drwxrw-rw-   - hadoop hadoop          0 2015-12-08 21:37 /user/hive/warehouse
drwxrw-rw-   - hadoop hadoop          0 2015-12-08 21:36 /user/hive/warehouse/mywork.db
drwxrw-rw-   - hadoop hadoop          0 2015-12-08 21:36 /user/hive/warehouse/mywork.db/student
[hadoop@dnode1 ~]$

Hive支持的数据类型如下:

    原生类型:

TINYINT                1字节
SMALLINT               2字节
INT                    4字节
BIGINT                 8字节
BOOLEAN                true/false
FLOAT                  4字节
DOUBLE                 8字节
STRING                 字符串
BINARY (Hive 0.8.0以上才可用)
TIMESTAMP (Hive 0.8.0以上才可用)

    复合类型:

arrays: ARRAY                有序字段,类型必须相同
maps: MAP    无序的键/值对
structs: STRUCT    一组命名的字段
union: UNIONTYPE

    说明:ARRAY数据类型通过下标来获取值,如arrays[0],MAP通过["指定域名称"]访问, STRUCT类型通过点方式访问(如structs.col_name)。  

建表示例:

hive> create table employee (
    > eno int comment 'the no of employee',
    > ename string comment 'name of employee',
    > salary float comment 'salary of employee',
    > subordinates array comment 'employees managed by current employee',
    > deductions map comment 'deductions',
    > address struct comment 'address'
    > ) comment 'This is table of employee info';
OK
Time taken: 0.33 seconds
hive>

在Hive中各列之间,以及复合类型内部使用了不同的分隔符来指定,每行数据对应一条记录。

Hive-1.2.0学习笔记(二)Hive数据类型

在${HIVE_HOME}/data目录下创建文件data_default.txt文件,采用默认分隔符,内容为:

Hive-1.2.0学习笔记(二)Hive数据类型

Hive默认的字段分隔符为ascii码的控制符\001,建表的时候用fields terminated by '\001'。造数据在vi 打开文件里面,用ctrl+v然后再ctrl+a可以输入这个控制符\001(即^A)。按顺序,\002的输入方式为ctrl+v,ctrl+b。以此类推。

说明:

1000                    员工编号
zhangsan                员工姓名
5000.0                  员工工资
lisi^Bwangwu            下属员工
ptax^C200^Bpension^C200                工资扣除金额(如税收等)
shandong^Bheze^Bdingtao^B274106        家庭住址(struct结构只需指定值即可)

加载数据

hive> load data local inpath 'data/data_default.txt' into table employee;
Loading data to table mywork.employee
Table mywork.employee stats: [numFiles=1, numRows=0, totalSize=83, rawDataSize=0]
OK
Time taken: 0.426 seconds
hive> select * from employee;
OK
1000    zhangsan        5000.0  ["lisi","wangwu"]       {"ptax":200.0,"pension":200.0}  {"province":"shandong","city":"heze","street":"dingtao","zip":274106}
Time taken: 0.114 seconds, Fetched: 1 row(s)
hive>
# 对于复合类型数据查询方式如下
hive> select eno, ename, salary, subordinates[0], deductions['ptax'], address.province from employee;
OK
1000    zhangsan        5000.0  lisi    200.0   shandong
Time taken: 0.129 seconds, Fetched: 1 row(s)
hive>

查看HDFS数据结构

[hadoop@dnode1 ~]$ hdfs dfs -ls -R /user/hive/warehouse/
drwxrw-rw-   - hadoop hadoop          0 2015-12-09 00:00 /user/hive/warehouse/mywork.db
drwxrw-rw-   - hadoop hadoop          0 2015-12-09 00:00 /user/hive/warehouse/mywork.db/employee
-rwxrw-rw-   2 hadoop hadoop         83 2015-12-09 00:00 /user/hive/warehouse/mywork.db/employee/data_default.txt
drwxrw-rw-   - hadoop hadoop          0 2015-12-08 23:03 /user/hive/warehouse/mywork.db/student
[hadoop@dnode1 ~]$ hdfs dfs -text /user/hive/warehouse/mywork.db/employee/data_default.txt
1000zhangsan5000.0lisiwangwuptax200pension200shandonghezedingtao274106
[hadoop@dnode1 ~]$

Hive-1.2.0学习笔记(二)Hive数据类型

Hive-1.2.0学习笔记(二)Hive数据类型

自定义分隔符:

hive> create table employee_02 (
    > eno int comment 'the no of employee',
    > ename string comment 'name of employee',
    > salary float comment 'salary of employee',
    > subordinates array comment 'employees managed by current employee',
    > deductions map comment 'deductions',
    > address struct comment 'address'
    > ) comment 'This is table of employee info'
    > row format delimited fields terminated by '\t'
    > collection items terminated by ','
    > map keys terminated by ':'
    > lines terminated by '\n';
OK
Time taken: 0.228 seconds
hive> load data local inpath 'data/data_employee02.txt' into table employee_02;
Loading data to table mywork.employee_02
Table mywork.employee_02 stats: [numFiles=1, totalSize=99]
OK
Time taken: 0.371 seconds
hive> select * from employee_02;
OK
1000    'zhangsan'      5000.0  ["'lisi'","'wangwu'"]   {"'ptax'":200.0,"'pension'":200.0}      {"province":"'shandong'","city":"'heze'","street":"'dingtao'","zip":274106}
Time taken: 0.101 seconds, Fetched: 1 row(s)
hive>

data/employee02.txt文件内容为

[hadoop@nnode data]$ pwd
/usr/local/hive1.2.0/data
[hadoop@nnode data]$ cat data_employee02.txt 
1000    'zhangsan'      5000.0  'lisi','wangwu' 'ptax':200,'pension':200        'shandong','heze','dingtao',274106
[hadoop@nnode data]$

说明:由于在文本文件中包含有单引号,在load到hive的表之后表示方式为属性加双引号,这里的单引号被认为了是属性或值的一部分了,需要注意。

查看详细表定义

# 建表时为默认设置
hive> describe formatted employee;
OK
# col_name              data_type               comment             
                 
eno                     int                     the no of employee  
ename                   string                  name of employee    
salary                  float                   salary of employee  
subordinates            array           employees managed by current employee
deductions              map       deductions          
address                 struct       address             
                 
# Detailed Table Information             
Database:               mywork                   
Owner:                  hadoop                   
CreateTime:             Tue Dec 08 23:10:07 CST 2015     
LastAccessTime:         UNKNOWN                  
Protect Mode:           None                     
Retention:              0                        
Location:               hdfs://cluster/user/hive/warehouse/mywork.db/employee    
Table Type:             MANAGED_TABLE            
Table Parameters:                
        COLUMN_STATS_ACCURATE   true                
        comment                 This is table of employee info
        numFiles                1                   
        numRows                 0                   
        rawDataSize             0                   
        totalSize               83                  
        transient_lastDdlTime   1449590423          
                 
# Storage Information            
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat:            org.apache.hadoop.mapred.TextInputFormat
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed:             No                       
Num Buckets:            -1                       
Bucket Columns:         []                       
Sort Columns:           []                       
Storage Desc Params:             
        serialization.format    1                   
Time taken: 0.098 seconds, Fetched: 37 row(s)

# 建表时自定义了分隔符
hive> describe formatted employee_02;
OK
# col_name              data_type               comment             
                 
eno                     int                     the no of employee  
ename                   string                  name of employee
salary                  float                   salary of employee  
subordinates            array           employees managed by current employee
deductions              map       deductions          
address                 struct address
                 
# Detailed Table Information             
Database:               mywork                   
Owner:                  hadoop                   
CreateTime:             Wed Dec 09 00:12:53 CST 2015     
LastAccessTime:         UNKNOWN                  
Protect Mode:           None                     
Retention:              0                        
Location:               hdfs://cluster/user/hive/warehouse/mywork.db/employee_02   
Table Type:             MANAGED_TABLE            
Table Parameters:                
        COLUMN_STATS_ACCURATE   true                
        comment                 This is table of employee info
        numFiles                1                   
        totalSize               99                  
        transient_lastDdlTime   1449591260          
                 
# Storage Information            
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat:            org.apache.hadoop.mapred.TextInputFormat
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed:             No                       
Num Buckets:            -1                       
Bucket Columns:         []                       
Sort Columns:           []                       
Storage Desc Params:             
        colelction.delim        ,                   
        field.delim             \t                  
        line.delim              \n                  
        mapkey.delim            :                   
        serialization.format    \t                  
Time taken: 0.116 seconds, Fetched: 39 row(s)
hive>

遗留问题:

hive> delete from student;
FAILED: SemanticException [Error 10294]: Attempt to do update or delete using transaction manager that does not support these operations.
hive>

注意事项:如果sql语句中含有tab格式的内容,则会出现如下问题

Hive-1.2.0学习笔记(二)Hive数据类型


本文名称:Hive-1.2.0学习笔记(二)Hive数据类型
分享路径:http://myzitong.com/article/gspiec.html