選擇率
在Oracle數據庫中,選擇率(Selectivity) 是優化器(CBO,基于成本的優化器)用來評估SQL語句中某個條件(如WHERE子句)過濾數據的比例的關鍵指標。它直接影響優化器選擇執行計劃的策略,例如決定是否使用索引或全表掃描。
選擇率表示滿足某個條件的行數占總行數的比例,
對于等值查詢,參考Oracle的數據字典dba_tab_columns的DENSITY和NUM_DISTINCT字段。
col table_name for a20
col column_name for a20
col LOW_VALUE for a20
col HIGH_VALUE for a20
set line 400
select table_name,column_name,num_distinct,density,NUM_NULLS,LOW_VALUE,HIGH_VALUE
from dba_tab_columns where table_name='EMP';TABLE_NAME COLUMN_NAME NUM_DISTINCT DENSITY NUM_NULLS LOW_VALUE HIGH_VALUE
-------------------- -------------------- ------------ ---------- ---------- -------------------- --------------------
EMP EMPNO 14 .071428571 0 C24A46 C25023
EMP ENAME 14 .071428571 0 4144414D53 57415244
EMP JOB 5 .2 0 414E414C595354 53414C45534D414E
EMP MGR 6 .038461538 1 C24C43 C25003
EMP HIREDATE 13 .076923077 0 77B40C11010101 77BB0517010101
EMP SAL 12 .083333333 0 C209 C233
EMP COMM 4 .25 10 80 C20F
EMP DEPTNO 3 .333333333 0 C10B C11F
對于等值查詢,如果該列沒有空值和直方圖統計信息,選擇率就是DENSITY的值或(1/NUM_DISTINCT);如果有空值,則可選擇率為:(1/NUM_DISTINCT)*(NUM_ROWS-NUM_NULLS)/NUM_ROWS
對于范圍查詢,選擇率的計算方法就在上述基礎上加入最大值和最小值的統計信息,這里就不多做贅述。
選擇率和索引
選擇率影響著一個SQL的執行計劃,準確的來說,選擇率影響表的訪問方式(即全表掃描還是索引掃描)。Oracle的SQL優化器是基于成本的,我們稱為CBO,CBO會依據選擇率來確定對某一數據集的訪問的成本(COST),從而選擇成本最低的訪問方式。
例如,表A有8行數據,在表A上對列col1有索引,列col1上有8個不同值,如果SQLA的謂詞條件為col1的等值查詢,對與SQLA的最優執行計劃,CBO會選擇索引掃描;如果表A對列col2有索引,列col2上只有2個不同值,如果SQLB的謂詞條件為col2的等值查詢,對與SQLB的最優執行計劃,CBO可能會選擇全表掃描,因為索引掃描的尋找葉子塊+回表的成本可能會大于全表掃描的成本。
下面我們做一下選擇率的測試
--創建表和索引
create table tab1(id int,name varchar2(10),gender varchar2(5));
create index idx_id on tab1(id);
create index idx_gender on tab1(gender);
--插入2000條數據,id列從1遞增,name列為隨機的5個字符串,gender列為隨機的‘f’或‘m’。
DECLARE-- 定義記錄類型和集合類型TYPE t_employee IS RECORD (id NUMBER,name VARCHAR2(5),gender CHAR(1));TYPE t_employee_tbl IS TABLE OF t_employee;v_data t_employee_tbl := t_employee_tbl(); -- 初始化集合
BEGIN-- 批量生成測試數據(200行)SELECT LEVEL AS id,-- 生成5位隨機大寫字母和數字組合(若只要字母可改用'X'參數)DBMS_RANDOM.STRING('X', 5) AS name,CASE WHEN DBMS_RANDOM.VALUE < 0.5 THEN 'm' ELSE 'f' END AS genderBULK COLLECT INTO v_dataFROM DUALCONNECT BY LEVEL <= 2000;-- 批量插入數據(使用FORALL提升性能)FORALL i IN 1 .. v_data.COUNTINSERT INTO scott.tab1 (id, name, gender)VALUES (v_data(i).id, v_data(i).name, v_data(i).gender);COMMIT; -- 提交事務
EXCEPTIONWHEN OTHERS THENROLLBACK; -- 異常回滾RAISE;
END;
/
--查看統計信息
ANALYZE TABLE scott.tab1 COMPUTE STATISTICS;
col table_name for a10
col column_name for a10
col LOW_VALUE for a20
col HIGH_VALUE for a20
set line 400
select table_name,column_name,num_distinct,density,NUM_NULLS,LOW_VALUE,HIGH_VALUE
from dba_tab_columns where table_name='TAB1';TABLE_NAME COLUMN_NAM NUM_DISTINCT DENSITY NUM_NULLS LOW_VALUE HIGH_VALUE
---------- ---------- ------------ ---------- ---------- -------------------- --------------------
TAB1 ID 2000 .0005 0 C102 C215
TAB1 NAME 2000 .0005 0 3030463839 5A5A555138
TAB1 GENDER 2 .5 0 66 6D
--分別以id和gener列為謂詞條件查看
SQL> set autotrace traceonly;
SQL> select * from scott.tab1 where id=6;Execution Plan
----------------------------------------------------------
Plan hash value: 4102116554----------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 9 | 2 (0)| 00:00:01 |
| 1 | TABLE ACCESS BY INDEX ROWID BATCHED| TAB1 | 1 | 9 | 2 (0)| 00:00:01 |
|* 2 | INDEX RANGE SCAN | IDX_ID | 1 | | 1 (0)| 00:00:01 |
----------------------------------------------------------------------------------------------SQL> select * from scott.tab1 where gender='f';1019 rows selected.Execution Plan
----------------------------------------------------------
Plan hash value: 2211052296--------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1000 | 9000 | 3 (0)| 00:00:01 |
|* 1 | TABLE ACCESS FULL| TAB1 | 1000 | 9000 | 3 (0)| 00:00:01 |
--------------------------------------------------------------------------
直方圖
上面都是該列上數據分布均勻的情況,如果數據分布不均勻,及時統計信息是最新的,但其執行計劃可能不是最優的。下面我們測試,
假設一個中學6年級有2000名學生,期中考試分為ABCDE五個等級,其中大部分同學的分數都集中在B,那么查詢分數為B的同學信息可能存在執行計劃不優的情況。
--創建表和索引
SQL> create table tab2(id int,name varchar2(10),grade char(1));
SQL> create index grade_idx on tab2(grade);
--插入數據
DECLARE-- 定義記錄類型和集合類型TYPE t_student IS RECORD (id NUMBER,name VARCHAR2(5),grade CHAR(1));TYPE t_student_tbl IS TABLE OF t_student;v_data t_student_tbl := t_student_tbl(); -- 初始化集合
BEGIN-- 批量生成測試數據(300行)SELECT LEVEL AS id,-- 生成5位隨機大寫字母和數字組合(若只要字母可改用'X'參數)DBMS_RANDOM.STRING('X', 5) AS name,CHR(65 + FLOOR(DBMS_RANDOM.VALUE(0,5))) AS grade -- 生成A-EBULK COLLECT INTO v_dataFROM DUALCONNECT BY LEVEL <= 2000;-- 批量插入數據(使用FORALL提升性能)FORALL i IN 1 .. v_data.COUNTINSERT INTO scott.tab2 (id, name, grade)VALUES (v_data(i).id, v_data(i).name, v_data(i).grade);COMMIT; -- 提交事務
EXCEPTIONWHEN OTHERS THENROLLBACK; -- 異常回滾RAISE;
END;
/把id為50-250的學生分數改為B
update tab2 set grade='B' where id>=50 and id <=2500;
commit;#查看統計信息
SQL> ANALYZE TABLE scott.tab2 COMPUTE STATISTICS;Table analyzed.select table_name,column_name,num_distinct,density,NUM_NULLS,LOW_VALUE,HIGH_VALUE2 from dba_tab_columns where table_name='TAB2';TABLE_NAME COLUMN_NAM NUM_DISTINCT DENSITY NUM_NULLS LOW_VALUE HIGH_VALUE
---------- ---------- ------------ ---------- ---------- -------------------- --------------------
TAB2 ID 2000 .0005 0 C102 C215
TAB2 NAME 2000 .0005 0 3031324C58 5A5A543245
TAB2 GRADE 5 .2 0 41 45#查詢分數為B的學生
SQL> set autotrace traceonly statistic;
SQL> select * from scott.tab2 where grade='B';1963 rows selected.Execution Plan
----------------------------------------------------------
Plan hash value: 1237454846-------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 400 | 3600 | 3 (0)| 00:00:01 |
| 1 | TABLE ACCESS BY INDEX ROWID BATCHED| TAB2 | 400 | 3600 | 3 (0)| 00:00:01 |
|* 2 | INDEX RANGE SCAN | GRADE_IDX | 400 | | 1 (0)| 00:00:01 |
-------------------------------------------------------------------------------------------------Predicate Information (identified by operation id):
---------------------------------------------------2 - access("GRADE"='B')Statistics
----------------------------------------------------------1 recursive calls0 db block gets272 consistent gets0 physical reads0 redo size59224 bytes sent via SQL*Net to client2037 bytes received via SQL*Net from client132 SQL*Net roundtrips to/from client0 sorts (memory)0 sorts (disk)1963 rows processed
--上面查詢采用的索引掃描,邏輯讀為272,如果強制讓SQL使用全表掃描SQL> select /*+FULL(tab2) */ * from scott.tab2 where grade='B';1963 rows selected.Execution Plan
----------------------------------------------------------
Plan hash value: 2156729920--------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 400 | 3600 | 3 (0)| 00:00:01 |
|* 1 | TABLE ACCESS FULL| TAB2 | 400 | 3600 | 3 (0)| 00:00:01 |
--------------------------------------------------------------------------Predicate Information (identified by operation id):
---------------------------------------------------1 - filter("GRADE"='B')Statistics
----------------------------------------------------------1 recursive calls0 db block gets138 consistent gets0 physical reads0 redo size54904 bytes sent via SQL*Net to client2037 bytes received via SQL*Net from client132 SQL*Net roundtrips to/from client0 sorts (memory)0 sorts (disk)1963 rows processed--邏輯讀為138,
明明全表掃描的邏輯讀更低,為什么CBO還是使用索引掃描的執行計劃呢?因為統計信息不知道列grade分布不均勻,安裝1/5的選擇率生成的執行計劃,這種情況可以通過收集列的直方圖來解決。
收集之前確定grade沒有直方圖統計信息
select table_name,column_name,num_distinct,density,HISTOGRAMfrom dba_tab_columns where table_name='TAB2';
TABLE_NAME COLUMN_NAM NUM_DISTINCT DENSITY HISTOGRAM
---------- ---------- ------------ ---------- ---------------
TAB2 ID 2000 .0005 NONE
TAB2 NAME 2000 .0005 NONE
TAB2 GRADE 5 .2 NONE
收集grade列直方圖
EXEC DBMS_STATS.GATHER_TABLE_STATS('SCOTT', 'TAB2', METHOD_OPT => 'FOR COLUMNS GRADE SIZE AUTO');
再次查看grade列統計信息
SQL> select table_name,column_name,num_distinct,density,HISTOGRAM2 from dba_tab_columns where table_name='TAB2';TABLE_NAME COLUMN_NAM NUM_DISTINCT DENSITY HISTOGRAM
---------- ---------- ------------ ---------- ---------------
TAB2 ID 2000 .0005 NONE
TAB2 NAME 2000 .0005 NONE
TAB2 GRADE 5 .00025 FREQUENCY
這時我們再次查看分數為B的學生信息
SQL> set autotrace traceonly;
SQL> select * from scott.tab2 where grade='B';1963 rows selected.Execution Plan
----------------------------------------------------------
Plan hash value: 2156729920--------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1963 | 19630 | 3 (0)| 00:00:01 |
|* 1 | TABLE ACCESS FULL| TAB2 | 1963 | 19630 | 3 (0)| 00:00:01 |
--------------------------------------------------------------------------Predicate Information (identified by operation id):
---------------------------------------------------1 - filter("GRADE"='B')Statistics
----------------------------------------------------------1 recursive calls0 db block gets138 consistent gets0 physical reads0 redo size54904 bytes sent via SQL*Net to client2037 bytes received via SQL*Net from client132 SQL*Net roundtrips to/from client0 sorts (memory)0 sorts (disk)1963 rows processed
收集了直方圖后,選擇了成本更低的執行計劃。