19-01-08
banq
Apache Hive支持许多文件格式,用于存储表格数据。有关 更多信息,请参阅以下链接。让我们测试其中的一些并尝试它之间的一些转换。
再次让我们假设我们的movies.txt文件在tests / dir HDFS目录中。
[root@sandbox ~]# cat movies.txt 1,scifi,Matrix,USA 2,scifi,Matrix-Reloaded,USA 3,comedy,TheBigBangTheory,USA 4,comedy,MrBean,GBR 5,comedy,CrocodileDundee,AUS 6,comedy,Pelisky,CZE 7,crime,CSI-NewYork,USA |
1. TextFile CSV -> Hive ORC
// CSV Table hiveContext.sql(" CREATE EXTERNAL TABLE IF NOT EXISTS CSVMovies (id String, genre String, name String, country String) "+ " row format delimited "+ " fields terminated by ','"+ " STORED AS TEXTFILE "+ " location '/tests/dir'" ); // ORC Table hiveContext.sql("CREATE TABLE IF NOT EXISTS ORCExpMovies(id String, genre String, name String, country String) "+ " ROW FORMAT DELIMITED "+ " FIELDS TERMINATED BY ',' "+ " STORED AS ORC" ); // CSV -> ORC (most efficient way!) hiveContext.sql("INSERT OVERWRITE TABLE ORCExpMovies SELECT * from CSVMovies"); hiveContext.sql("SELECT * FROM CSVMovies").show(); |
Zeppelin output:
import org.apache.spark.sql.hive.HiveContext hiveContext: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@d8123e4 res3: org.apache.spark.sql.DataFrame = [result: string] res4: org.apache.spark.sql.DataFrame = [result: string] res5: org.apache.spark.sql.DataFrame = [] +---+------+----------------+-------+ | id| genre| name|country| +---+------+----------------+-------+ | 1| scifi| Matrix| USA| | 2| scifi| Matrix-Reloaded| USA| | 3|comedy|TheBigBangTheory| USA| | 4|comedy| MrBean| GBR| | 5|comedy| CrocodileDundee| AUS| | 6|comedy| Pelisky| CZE| | 7| crime| CSI-NewYork| USA| +---+------+----------------+-------+ |
2. Transformation query -> 新的Dataframe -> 导入新的orc表...
现在让我们添加以下scala测试:
// Transformation query -> New Dataframe -> Import to new orc table... val moviesDF = hiveContext.sql("select * from ORCExpMovies where country ='USA'"); moviesDF.write.mode("overwrite").format("orc").saveAsTable("AuxTable"); // ORC -> ORC hiveContext.sql(" INSERT INTO TABLE UsaMovies SELECT * FROM AuxTable "); val parquetDF = hiveContext.sql(" SELECT * from UsaMovies "); parquetDF.show(); |
Zeppelin output:
import org.apache.spark.sql.hive.HiveContext hiveContext: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@4d83e5d4 res15: org.apache.spark.sql.DataFrame = [result: string] res16: org.apache.spark.sql.DataFrame = [result: string] res17: org.apache.spark.sql.DataFrame = [] res18: org.apache.spark.sql.DataFrame = [] res19: org.apache.spark.sql.DataFrame = [result: string] moviesDF: org.apache.spark.sql.DataFrame = [id: string, genre: string, name: string, country: string] res21: org.apache.spark.sql.DataFrame = [] parquetDF: org.apache.spark.sql.DataFrame = [id: string, genre: string, name: string, country: string] +---+------+----------------+-------+ | id| genre| name|country| +---+------+----------------+-------+ | 1| scifi| Matrix| USA| | 2| scifi| Matrix-Reloaded| USA| | 3|comedy|TheBigBangTheory| USA| | 7| crime| CSI-NewYork| USA| +---+------+----------------+-------+ |
3. Dataframe with ORC data -> HDFS Parquet -> HDFS Parquet back to Dataframe
// ORC DF -> parquet parquetDF.write.mode("overwrite").format("parquet").save("/tests/parquetTabule"); // parquet -> Dataframe var parDF = hiveContext.read.format("parquet").load("/tests/parquetTabule"); parDF.show(); |
Zeppelin output:
import org.apache.spark.sql.hive.HiveContext hiveContext: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@38e3304a res23: org.apache.spark.sql.DataFrame = [result: string] res24: org.apache.spark.sql.DataFrame = [result: string] res25: org.apache.spark.sql.DataFrame = [] res26: org.apache.spark.sql.DataFrame = [] res27: org.apache.spark.sql.DataFrame = [result: string] moviesDF: org.apache.spark.sql.DataFrame = [id: string, genre: string, name: string, country: string] res29: org.apache.spark.sql.DataFrame = [] parquetDF: org.apache.spark.sql.DataFrame = [id: string, genre: string, name: string, country: string] parDF: org.apache.spark.sql.DataFrame = [id: string, genre: string, name: string, country: string] +---+------+----------------+-------+ | id| genre| name|country| +---+------+----------------+-------+ | 1| scifi| Matrix| USA| | 2| scifi| Matrix-Reloaded| USA| | 3|comedy|TheBigBangTheory| USA| | 7| crime| CSI-NewYork| USA| +---+------+----------------+-------+ |
有趣的是将数据帧保存为parquet格式后'/ tests / parquetTabule'HDFS目录的内容:
[root@sandbox ~]# hdfs dfs -ls /tests/parquetTabule Found 5 items -rw-r--r-- 1 zeppelin hdfs 0 2018-08-07 21:07 /tests/parquetTabule/_SUCCESS -rw-r--r-- 1 zeppelin hdfs 445 2018-08-07 21:07 /tests/parquetTabule/_common_metadata -rw-r--r-- 1 zeppelin hdfs 1356 2018-08-07 21:07 /tests/parquetTabule/_metadata -rw-r--r-- 1 zeppelin hdfs 1043 2018-08-07 21:07 /tests/parquetTabule/part-r-00000-aac00b4b-17f3-44b1-9581-dbded6a3ad5e.gz.parquet -rw-r--r-- 1 zeppelin hdfs 922 2018-08-07 21:07 /tests/parquetTabule/part-r-00001-aac00b4b-17f3-44b1-9581-dbded6a3ad5e.gz.parquet |
注意压缩数据。
整个测试代码见这里
猜你喜欢