SQL经常使用嵌套循环的子查询,这样性能是慢的。
嵌套循环查询SQL如下:
SELECT first_name, last_name, (SELECT count(*) FROM film_actor fa WHERE fa.actor_id = a.actor_id) FROM actor a
|
这会强迫数据库引擎进入一个嵌套循环查询,下面是伪代码演示:
for (Actor a : actor) { output( a.first_name, a.last_name, film_actor.where(fa -> fa.actor_id = a.actor_id) .size() }
|
对于每条actor记录,对其film_actor字段进行计数,这会产生许多film遍历。相反,使用bulk聚合反而更好些:
SELECT first_name, last_name, count(*) FROM actor a JOIN film_actor fa USING (actor_id) GROUP BY actor_id, first_name, last_name
|
这种bulk聚合是搜集所有actor集合和收集所有film_actor集合,然后在内存中通过group融合merge它们,Oracle的执行方式大概如下:
------------------------------------------------------------------- | Id | Operation | Name | A-Rows | ------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 200 | | 1 | HASH GROUP BY | | 200 | |* 2 | HASH JOIN | | 5462 | | 3 | TABLE ACCESS FULL | ACTOR | 200 | | 4 | INDEX FAST FULL SCAN| IDX_FK_FILM_ACTOR_ACTOR | 5462 | -------------------------------------------------------------------
|
注意,对于每个actor,有5462行filem_actor,我们join并group和汇聚它们得到最终结果,再看看之前第一个代码嵌套循环查询是如何执行的:
----------------------------------------------------------------------- | Id | Operation | Name | Starts | A-Rows | ----------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | 200 | | 1 | SORT AGGREGATE | | 200 | 200 | |* 2 | INDEX RANGE SCAN| IDX_FK_FILM_ACTOR_ACTOR | 200 | 5462 | | 3 | TABLE ACCESS FULL| ACTOR | 1 | 200 | -----------------------------------------------------------------------
|
列Starts是说明搜集了所有200个actor之后的情况,在在行INDEX RANGE SCAN我们为每个actor开始子查询200次才能得到film_actor总数。我们比较一下这两种方式哪个更快?使用group的聚合(Bulk aggregation)与使用嵌套循环的子查询哪个更快?
group聚合因为将所有数据加载到内存,所以耗费更多内存,但是算法复杂性低;而嵌套循环查询是直接从索引获取有信息,因此耗费内存少,但是有更高的算法复杂性(二次方程?)。
实际并不是这样(banq:数据库的复杂性和不确定性可见一斑了)
group聚合是线性的,但是按照O(M+N),其中M=actor数量而N=film_actor数量,嵌套循环不是这样二次方程,而是对数方程,是O(M log N),这里的log是对数的意思,我们其实不需要遍历整个索引来获得总数。
越高复杂性意味着不好,但是数据量比较小时并不是这样。
在对group聚合和嵌套循环对比性能如下,嵌套循环性能超过group聚合:
Nested select: +000000000 00:00:01.122000000 Group by : +000000000 00:00:03.191000000
Nested select: +000000000 00:00:01.116000000 Group by : +000000000 00:00:03.104000000
Nested select: +000000000 00:00:01.122000000 Group by : +000000000 00:00:03.228000000
|
如果我们使用一个衍生表来优化嵌套GROUP BY操作,代码如下:
SELECT first_name, last_name, c FROM actor a JOIN ( SELECT actor_id, count(*) c FROM film_actor GROUP BY actor_id ) USING (actor_id)
|
经过测试,这个衍生表(derived table)性能最好:
Nested select : +000000000 00:00:01.371000000 Group by : +000000000 00:00:03.417000000 Group by in derived table: +000000000 00:00:01.592000000
Nested select : +000000000 00:00:01.236000000 Group by : +000000000 00:00:03.329000000 Group by in derived table: +000000000 00:00:01.595000000
|
Correlated Subqueries are Evil and Slow. Or are Th
上述测试基于Oracle数据库,如果你觉得每个数据库是个黑盒很难了解,同时因为引入join难以横向扩展,难以分区分表,那么可以考虑在应用层进行集合的联合比较。
见:Node.js CPU密集型的Join操作