Hive数据格式转换

19-01-08 banq
    

Apache Hive支持许多文件格式,用于存储表格数据。有关 更多信息,请参阅以下链接。让我们测试其中的一些并尝试它之间的一些转换。

再次让我们假设我们的movies.txt文件在tests / dir HDFS目录中。

[root@sandbox ~]# cat movies.txt
1,scifi,Matrix,USA
2,scifi,Matrix-Reloaded,USA
3,comedy,TheBigBangTheory,USA
4,comedy,MrBean,GBR
5,comedy,CrocodileDundee,AUS
6,comedy,Pelisky,CZE
7,crime,CSI-NewYork,USA

1. TextFile CSV -> Hive ORC

// CSV Table
hiveContext.sql(" CREATE EXTERNAL TABLE IF NOT EXISTS CSVMovies (id String, genre String, name String, country String) "+
                " row format delimited "+
                " fields terminated by ','"+
                " STORED AS TEXTFILE "+
                " location '/tests/dir'"
);

// ORC Table
hiveContext.sql("CREATE TABLE IF NOT EXISTS ORCExpMovies(id String, genre String, name String, country String) "+
                " ROW FORMAT DELIMITED "+
                " FIELDS TERMINATED BY ',' "+
                " STORED AS ORC"
);

// CSV -> ORC (most efficient way!)
hiveContext.sql("INSERT OVERWRITE TABLE ORCExpMovies SELECT * from CSVMovies"); 
hiveContext.sql("SELECT * FROM CSVMovies").show();

Zeppelin output:

import org.apache.spark.sql.hive.HiveContext
hiveContext: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@d8123e4
res3: org.apache.spark.sql.DataFrame = [result: string]
res4: org.apache.spark.sql.DataFrame = [result: string]
res5: org.apache.spark.sql.DataFrame = []
+---+------+----------------+-------+
| id| genre|            name|country|
+---+------+----------------+-------+
|  1| scifi|          Matrix|    USA|
|  2| scifi| Matrix-Reloaded|    USA|
|  3|comedy|TheBigBangTheory|    USA|
|  4|comedy|          MrBean|    GBR|
|  5|comedy| CrocodileDundee|    AUS|
|  6|comedy|         Pelisky|    CZE|
|  7| crime|     CSI-NewYork|    USA|
+---+------+----------------+-------+

2. Transformation query -> 新的Dataframe -> 导入新的orc表...

现在让我们添加以下scala测试:

// Transformation query -> New Dataframe -> Import to new orc table...
val moviesDF = hiveContext.sql("select * from ORCExpMovies where country ='USA'");
moviesDF.write.mode("overwrite").format("orc").saveAsTable("AuxTable");

// ORC -> ORC
hiveContext.sql(" INSERT INTO TABLE UsaMovies SELECT * FROM AuxTable ");
val parquetDF = hiveContext.sql(" SELECT * from UsaMovies ");

parquetDF.show();

Zeppelin output:

import org.apache.spark.sql.hive.HiveContext
hiveContext: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@4d83e5d4
res15: org.apache.spark.sql.DataFrame = [result: string]
res16: org.apache.spark.sql.DataFrame = [result: string]
res17: org.apache.spark.sql.DataFrame = []
res18: org.apache.spark.sql.DataFrame = []
res19: org.apache.spark.sql.DataFrame = [result: string]
moviesDF: org.apache.spark.sql.DataFrame = [id: string, genre: string, name: string, country: string]
res21: org.apache.spark.sql.DataFrame = []
parquetDF: org.apache.spark.sql.DataFrame = [id: string, genre: string, name: string, country: string]
+---+------+----------------+-------+
| id| genre|            name|country|
+---+------+----------------+-------+
|  1| scifi|          Matrix|    USA|
|  2| scifi| Matrix-Reloaded|    USA|
|  3|comedy|TheBigBangTheory|    USA|
|  7| crime|     CSI-NewYork|    USA|
+---+------+----------------+-------+

3. Dataframe with ORC data -> HDFS Parquet -> HDFS Parquet back to Dataframe

// ORC DF -> parquet
parquetDF.write.mode("overwrite").format("parquet").save("/tests/parquetTabule");

// parquet -> Dataframe
var parDF = hiveContext.read.format("parquet").load("/tests/parquetTabule");
parDF.show();

Zeppelin output:

import org.apache.spark.sql.hive.HiveContext
hiveContext: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@38e3304a
res23: org.apache.spark.sql.DataFrame = [result: string]
res24: org.apache.spark.sql.DataFrame = [result: string]
res25: org.apache.spark.sql.DataFrame = []
res26: org.apache.spark.sql.DataFrame = []
res27: org.apache.spark.sql.DataFrame = [result: string]
moviesDF: org.apache.spark.sql.DataFrame = [id: string, genre: string, name: string, country: string]
res29: org.apache.spark.sql.DataFrame = []
parquetDF: org.apache.spark.sql.DataFrame = [id: string, genre: string, name: string, country: string]
parDF: org.apache.spark.sql.DataFrame = [id: string, genre: string, name: string, country: string]
+---+------+----------------+-------+
| id| genre|            name|country|
+---+------+----------------+-------+
|  1| scifi|          Matrix|    USA|
|  2| scifi| Matrix-Reloaded|    USA|
|  3|comedy|TheBigBangTheory|    USA|
|  7| crime|     CSI-NewYork|    USA|
+---+------+----------------+-------+

有趣的是将数据帧保存为parquet格式后'/ tests / parquetTabule'HDFS目录的内容:

[root@sandbox ~]# hdfs dfs -ls /tests/parquetTabule
Found 5 items
-rw-r--r--   1 zeppelin hdfs          0 2018-08-07 21:07 /tests/parquetTabule/_SUCCESS
-rw-r--r--   1 zeppelin hdfs        445 2018-08-07 21:07 /tests/parquetTabule/_common_metadata
-rw-r--r--   1 zeppelin hdfs       1356 2018-08-07 21:07 /tests/parquetTabule/_metadata
-rw-r--r--   1 zeppelin hdfs       1043 2018-08-07 21:07 /tests/parquetTabule/part-r-00000-aac00b4b-17f3-44b1-9581-dbded6a3ad5e.gz.parquet
-rw-r--r--   1 zeppelin hdfs        922 2018-08-07 21:07 /tests/parquetTabule/part-r-00001-aac00b4b-17f3-44b1-9581-dbded6a3ad5e.gz.parquet

注意压缩数据。

整个测试代码见这里