大数据Big Data 性能基准测试
Big Data Benchmark from UC Berkeley 对几个大数据产品进行了性能测试:
- Redshift - 亚马逊基于ParAccel 数据仓库提供的大数据产品
- Hive - 著名的基于Hadoop数据仓库系统
- Shark - 兼容Hive SQL的引擎,基于Spark 计算框架 (v0.8 preview, 5/2013)
- Impala - 兼容Hive的 SQL引擎,拥有自己的MPP-like执行引擎. (v1.0, 4/2013)
该实验基于EC2 ,以三种不同查询进行测试,结果如下:
一 、扫描查询Scan query
SELECT pageURL, pageRank FROM rankings WHERE pageRank > X 进行了三组不同数据量的测试:
Query 1A 32,888 results |
Query 1B 3,331,851 results |
Query 1C 89,974,976 results |
---|
响应时间结果如下:
Median Response Time (s) | |||
Redshift | 2.4 | 2.5 | 12.2 |
Impala - disk | 9.9 | 12 | 104 |
Impala - mem | 0.75 | 4.48 | 108 |
Shark - disk | 11.8 | 11.9 | 24.9 |
Shark - mem | 1.1 | 1.1 | 3.5 |
Hive | 45 | 63 | 70 |
Shark在不使用磁盘情况下最好,而在使用磁盘情况下Redshift最好。
二、聚合查询
SELECT SUBSTR(sourceIP, 1, X), SUM(adRevenue) FROM uservisits GROUP BY SUBSTR(sourceIP, 1, X)数据量三批:
Query 2A 2,067,313 groups |
Query 2B 31,348,913 groups |
Query 2C 253,890,330 groups |
---|
响应时间结果如下:
Median Response Time (s) |
|||
Redshift | 28 | 65 | 92 |
Impala - disk | 130 | 216 | 565 |
Impala - mem | 121 | 208 | 557 |
Shark - disk | 210 | 238 | 279 |
Shark - mem | 111 | 141 | 156 |
Hive - disk | 466 | 490 | 552 |
Redshift和Shark - mem 名列前两名
三、Join查询
SELECT sourceIP, totalRevenue, avgPageRank
FROM
(SELECT sourceIP,
AVG(pageRank) as avgPageRank,
SUM(adRevenue) as totalRevenue
FROM Rankings AS R, UserVisits AS UV
WHERE R.pageURL = UV.destURL
AND UV.visitDate BETWEEN Date(`1980-01-01') AND Date(`X')
GROUP BY UV.sourceIP)
ORDER BY totalRevenue DESC LIMIT 1
测试批次:
Query 3A 485,312 rows |
Query 3B 53,332,015 rows |
Query 3C 533,287,121 rows |
---|
响应时间结果:
Median Response Time (s) |
|||
Redshift | 42 | 47 | 200 |
Impala - disk | 158 | 168 | 345 |
Impala - mem | 74 | 90 | 337 |
Shark - disk | 253 | 277 | 538 |
Shark - mem | 131 | 172 | 447 |
Hive | 423 | 638 | 1822 |
四、UDF查询
CREATE TABLE url_counts_partial AS
SELECT TRANSFORM (line)
USING "python /root/url_count.py" as (sourcePage, destPage, cnt)
FROM documents;
CREATE TABLE url_counts_total AS
SELECT SUM(cnt) AS totalCount, destPage
FROM url_counts_partial
GROUP BY destPage;
Median Response Time (s) |
|||
Redshift | not supported | - | - |
Impala - mem | not supported | - | - |
Impala - disk | not supported | - | - |
Shark - mem | 156 | 34 | 189 |
Shark - disk | 583 | 133 | 716 |
Hive | 659 | 358 | 1017 |
总体来说Shark - mem比较全面性能面临前茅,Redshift性能最优秀。
大数据专题