大数据Big Data 性能基准测试

  Big Data Benchmark from UC Berkeley 对几个大数据产品进行了性能测试:

  • Redshift - 亚马逊基于ParAccel 数据仓库提供的大数据产品
  • Hive - 著名的基于Hadoop数据仓库系统
  • Shark - 兼容Hive SQL的引擎,基于Spark 计算框架 (v0.8 preview, 5/2013)
  • Impala - 兼容Hive的 SQL引擎,拥有自己的MPP-like执行引擎. (v1.0, 4/2013)

该实验基于EC2 ,以三种不同查询进行测试,结果如下:

一 、扫描查询Scan query

SELECT pageURL, pageRank FROM rankings WHERE pageRank > X

进行了三组不同数据量的测试:
Query 1A
32,888 results
Query 1B
3,331,851 results
Query 1C
89,974,976 results

响应时间结果如下:

Median Response Time (s)
Redshift 2.4 2.5 12.2
Impala - disk 9.9 12 104
Impala - mem 0.75 4.48 108
Shark - disk 11.8 11.9 24.9
Shark - mem 1.1 1.1 3.5
Hive 45 63 70

Shark在不使用磁盘情况下最好,而在使用磁盘情况下Redshift最好。

二、聚合查询

SELECT SUBSTR(sourceIP, 1, X), SUM(adRevenue) FROM uservisits GROUP BY SUBSTR(sourceIP, 1, X)
数据量三批:
Query 2A
2,067,313 groups
Query 2B
31,348,913 groups
Query 2C
253,890,330 groups

响应时间结果如下:


Median Response Time (s)
Redshift 28 65 92
Impala - disk 130 216 565
Impala - mem 121 208 557
Shark - disk 210 238 279
Shark - mem 111 141 156
Hive - disk 466 490 552

Redshift和Shark - mem 名列前两名

三、Join查询

SELECT sourceIP, totalRevenue, avgPageRank
FROM
(SELECT sourceIP,
AVG(pageRank) as avgPageRank,
SUM(adRevenue) as totalRevenue
FROM Rankings AS R, UserVisits AS UV
WHERE R.pageURL = UV.destURL
AND UV.visitDate BETWEEN Date(`1980-01-01') AND Date(`X')
GROUP BY UV.sourceIP)
ORDER BY totalRevenue DESC LIMIT 1

测试批次:

Query 3A
485,312 rows
Query 3B
53,332,015 rows
Query 3C
533,287,121 rows

响应时间结果:


Median Response Time (s)
Redshift 42 47 200
Impala - disk 158 168 345
Impala - mem 74 90 337
Shark - disk 253 277 538
Shark - mem 131 172 447
Hive 423 638 1822

四、UDF查询

CREATE TABLE url_counts_partial AS
SELECT TRANSFORM (line)
USING "python /root/url_count.py" as (sourcePage, destPage, cnt)
FROM documents;
CREATE TABLE url_counts_total AS
SELECT SUM(cnt) AS totalCount, destPage
FROM url_counts_partial
GROUP BY destPage;



Median Response Time (s)
Redshift not supported - -
Impala - mem not supported - -
Impala - disk not supported - -
Shark - mem 156 34 189
Shark - disk 583 133 716
Hive 659 358 1017
 

总体来说Shark - mem比较全面性能面临前茅,Redshift性能最优秀。

 

大数据专题