SQL的group聚合与嵌套循环子查询的性能分析

SQL经常使用嵌套循环的子查询,这样性能是慢的。

嵌套循环查询SQL如下:


SELECT
first_name, last_name,
(SELECT count(*)
FROM film_actor fa
WHERE fa.actor_id = a.actor_id)
FROM actor a

这会强迫数据库引擎进入一个嵌套循环查询,下面是伪代码演示:


for (Actor a : actor) {
output(
a.first_name,
a.last_name,
film_actor.where(fa -> fa.actor_id = a.actor_id)
.size()
}

对于每条actor记录,对其film_actor字段进行计数,这会产生许多film遍历。

相反,使用bulk聚合反而更好些:


SELECT
first_name, last_name, count(*)
FROM actor a
JOIN film_actor fa USING (actor_id)
GROUP BY actor_id, first_name, last_name

这种bulk聚合是搜集所有actor集合和收集所有film_actor集合,然后在内存中通过group融合merge它们,Oracle的执行方式大概如下:


-------------------------------------------------------------------
| Id | Operation | Name | A-Rows |
-------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 200 |
| 1 | HASH GROUP BY | | 200 |
|* 2 | HASH JOIN | | 5462 |
| 3 | TABLE ACCESS FULL | ACTOR | 200 |
| 4 | INDEX FAST FULL SCAN| IDX_FK_FILM_ACTOR_ACTOR | 5462 |
-------------------------------------------------------------------

注意,对于每个actor,有5462行filem_actor,我们join并group和汇聚它们得到最终结果,再看看之前第一个代码嵌套循环查询是如何执行的:

-----------------------------------------------------------------------
| Id | Operation | Name | Starts | A-Rows |
-----------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 200 |
| 1 | SORT AGGREGATE | | 200 | 200 |
|* 2 | INDEX RANGE SCAN| IDX_FK_FILM_ACTOR_ACTOR | 200 | 5462 |
| 3 | TABLE ACCESS FULL| ACTOR | 1 | 200 |
-----------------------------------------------------------------------

列Starts是说明搜集了所有200个actor之后的情况,在在行INDEX RANGE SCAN我们为每个actor开始子查询200次才能得到film_actor总数。

我们比较一下这两种方式哪个更快?使用group的聚合(Bulk aggregation)与使用嵌套循环的子查询哪个更快?

group聚合因为将所有数据加载到内存,所以耗费更多内存,但是算法复杂性低;而嵌套循环查询是直接从索引获取有信息,因此耗费内存少,但是有更高的算法复杂性(二次方程?)。

实际并不是这样(banq:数据库的复杂性和不确定性可见一斑了)

group聚合是线性的,但是按照O(M+N),其中M=actor数量而N=film_actor数量,嵌套循环不是这样二次方程,而是对数方程,是O(M log N),这里的log是对数的意思,我们其实不需要遍历整个索引来获得总数。

越高复杂性意味着不好,但是数据量比较小时并不是这样。

在对group聚合和嵌套循环对比性能如下,嵌套循环性能超过group聚合:


Nested select: +000000000 00:00:01.122000000
Group by : +000000000 00:00:03.191000000

Nested select: +000000000 00:00:01.116000000
Group by : +000000000 00:00:03.104000000

Nested select: +000000000 00:00:01.122000000
Group by : +000000000 00:00:03.228000000

如果我们使用一个衍生表来优化嵌套GROUP BY操作,代码如下:


SELECT
first_name, last_name, c
FROM actor a
JOIN (
SELECT actor_id, count(*) c
FROM film_actor
GROUP BY actor_id
) USING (actor_id)

经过测试,这个衍生表(derived table)性能最好:

Nested select : +000000000 00:00:01.371000000
Group by : +000000000 00:00:03.417000000
Group by in derived table: +000000000 00:00:01.592000000

Nested select : +000000000 00:00:01.236000000
Group by : +000000000 00:00:03.329000000
Group by in derived table: +000000000 00:00:01.595000000

Correlated Subqueries are Evil and Slow. Or are Th

上述测试基于Oracle数据库,如果你觉得每个数据库是个黑盒很难了解,同时因为引入join难以横向扩展,难以分区分表,那么可以考虑在应用层进行集合的联合比较。
见:Node.js CPU密集型的Join操作