Extended statistic, also known as column group extension, is one of the important statistic improvements introduced with Oracle 11g. While Oracle Cost Based Optimizer is able to get a correct single column selectivity estimation, it is, however, unable to figure out the cardinality of a conjunction of two or more correlated columns present in a query predicate. A column group extension calculated for this conjunction of columns aims to help the CBO figuring out this columns correlation in order to get an accurate estimation. But there are cases where the CBO refuses to use a column group extension. This article aims to show one of those cases via a concrete example.
The scene
Below is the table and its unique index on which I am going to show you when the CBO will not use the column group extension:
create table t_ext_stat ( dvpk_id number(10) not null , vpk_id number(10) not null , layer_code varchar2(1 char) not null , dvpk_day date not null , cre_date date not null , cre_usr varchar2(40 char) not null , mod_date date not null , mod_usr varchar2(40 char) not null ); create unique index t_ext_uk_i on t_ext_stat(vpk_id, layer_code, dvpk_day);
And this is the query I will be using all over the article
select count(1) from t_ext_stat where vpk_id = 63148 and layer_code = 'R'; COUNT(1) ---------- 338
The two columns in the predicate part, layer_code and vpk_id are compared against an equality which makes them candidate for a column group extension; but let’s see first how skew are these two columns starting by the layer_code
The layer_code column has 4 distinct values with two popular ones: R (400,087) and S (380,069) which can be captured via a frequency histogram.
The vpk_id does not present such a noticeable skewness in its data scattering as shown by its representative chart:
It has 4947 distinct values ranging from a vpk_id value of 62866 with 1456 occurrences to a vpk_id value 62972 with a single occurrence
SQL> select vpk_id ,count(1) from t_ext_stat group by vpk_id order by 2 desc; VPK_ID COUNT(1) ---------- ---------- 62866 1456 62953 1456 63528 1456 63526 1456 63518 1456 62947 1456 62850 1456 62849 1456 62851 1456 62954 1456 64362 1452 64538 1424 64483 1358 …. 63207 1 63021 1 62972 1 4947 rows selected.
Extended Statistics and histogram
In order to create a column group extension we need to make a call to the following piece of code:
SQL> SELECT dbms_stats.create_extended_stats (ownname => user ,tabname => 't_ext_stat' ,extension =>'(vpk_id,layer_code)' ) FROM dual;
which will create a virtual column (SYS_STUMVIRBZA6_$QWEX6DE2NGQA1) supporting the two predicate column correlation.
Next, I will gather statistics with histogram for all t_ext_stat table columns including the above newly created virtual one:
BEGIN dbms_stats.gather_table_stats (user ,'t_ext_stat' ,method_opt => 'for all columns size auto' ,cascade => true ,no_invalidate => false ); END; /
And let’s check the collected columns statistics
SQL> SELECT column_name ,num_distinct ,density ,histogram FROM user_tab_col_statistics WHERE table_name = 'T_EXT_STAT' AND column_name in ('VPK_ID','LAYER_CODE','SYS_STUMVIRBZA6_$QWEX6DE2NGQA1'); COLUMN_NAME NUM_DISTINCT DENSITY HISTOGRAM ------------------------------ ------------ ---------- --------------- SYS_STUMVIRBZA6_$QWEX6DE2NGQA1 4967 .000201329 NONE LAYER_CODE 4 6.2471E-07 FREQUENCY VPK_ID 2862 .000349406 NONE
As expected a skew has been identified on the layer_code column and therefore a frequency histogram has been gathered on it to indicate this skewness. There is nevertheless two remarks which seems to be worth to mention:
- Since one of the column group extension has a histogram why the extension itself has not been identified as a skewed column as well?
- What happens in this particular case where there is no histogram on the extension and a histogram on one of the column forming the extension?
It is easy to answer the first question by looking directly at the column group scattering chart presented below:
Where we can notice that the extension does not present a skewness in its data scattering. In fact the extension has 10,078 distinct values where the most popular value appears 728 times while the less popular appears only once:
SQL> select to_char(SYS_STUMVIRBZA6_$QWEX6DE2NGQA1) extension ,count(1) from t_ext_stat group by SYS_STUMVIRBZA6_$QWEX6DE2NGQA1 order by 2 desc; EXTENSION COUNT(1) --------------------- ---------- 10113707817839868275 728 6437420856234749785 728 6264201076174478674 728 7804673458963442057 728 2433504440213765306 728 6976215179539283979 728 493591537539092624 728 6710977030485345437 1 18158393637293365880 1 5275318825200713603 1 13895660777899711317 1
This is a clear demonstration that it is not because there is a massive skew in one column forming the extension that the resulting column group combination will necessary present a skew. This is particularly true when the other column has a large number of dictint values (> 254)
But you might wonder why one has to care about this absence of histogram in the extension? Christian Antognini has already answered this question in this article where he wrote “be careful of extensions without histograms. They might be bypassed by the query optimizer”. In fact if one of the columns forming the extension has a histogram while the extension itself has no histogram then the Optimizer will not use the extension.
Here below is a demonstration of this claim taken from this current model:
select count(1) from t_ext_stat where vpk_id = 63148 and layer_code = 'R'; COUNT(1) ---------- 338 SQL_ID d26ra17afbfyh, child number 0 ------------------------------------- ------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | ------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 1 | | 1 | SORT AGGREGATE | | 1 | 1 | 1 | |* 2 | INDEX RANGE SCAN| T_EXT_UK_I | 1 | 142 | 338 | ------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("VPK_ID"=63148 AND "LAYER_CODE"='R')
How can we prove that Oracle didn’t used the extension to compute the 142 estimated rows when accessing the underlying index table? Well by looking at the corresponding 10053 trace file
Access path analysis for T_EXT_STAT *************************************** SINGLE TABLE ACCESS PATH Single Table Cardinality Estimation for T_EXT_STAT[T_EXT_STAT] SPD: Directive valid: dirid = 17542990197075222359, state = 5, flags = 1, loc = 1 {EC(98564)[2, 3]} SPD: Return code in qosdDSDirSetup: EXISTS, estType = TABLE Column (#2): VPK_ID(NUMBER) AvgLen: 5 NDV: 2862 Nulls: 0 Density: 0.000000 Min: 0.000000 Max: 62849.000000 Column (#3): NewDensity:0.002043, OldDensity:0.000001 BktCnt:5873.000000, PopBktCnt:5873.000000, PopValCnt:4, NDV:4 Column (#3): LAYER_CODE(VARCHAR2) AvgLen: 2 NDV: 4 Nulls: 0 Density: 0.000000 Histogram: Freq #Bkts: 4 UncompBkts: 5873 EndPtVals: 4 ActualVal: no Column (#9): SYS_STUMVIRBZA6_$QWEX6DE2NGQA1(NUMBER) AvgLen: 12 NDV: 4967 Nulls: 0 Density: 0.000000 Min: 0.000000 Max: 1980066.000000 ColGroup (#2, Index) T_EXT_UK_I Col#: 2 3 4 CorStregth: -1.00 ColGroup (#1, VC) SYS_STUMVIRBZA6_$QWEX6DE2NGQA1 Col#: 2 3 CorStregth: 2.30 ColGroup Usage:: PredCnt: 2 Matches Full: Partial: Table: T_EXT_STAT Alias: T_EXT_STAT Card: Original: 803809.000000 Rounded: 142 Computed: 141.74 Non Adjusted: 141.74
If Oracle has used the extension to compute the 142 estimated rows it will have then used the following formula:
E-rows = num_rows(t_ext_stat) * 1/(NDV(SYS_STUMVIRBZA6_$QWEX6DE2NGQA1)) E-rows = 803809 * 1/(4967) = 161.829877
Another clue showing that the Optimizer didn’t used the extension is visible in the above 10053 trace file as well via the following lines:
ColGroup Usage:: PredCnt: 2 Matches Full: Partial:
where the Matches Full and Partial information are null.
As mentioned by Christian’s article there is a fix to what seems to be identified as a bug we can set to make Oracle using the extension:
SQL> alter session set "_fix_control"="6972291:ON"; SQL> alter session set events '10053 trace name context forever, level 1'; SQL> select count(1) from t_ext_stat where vpk_id = 63148 and layer_code = 'R'; COUNT(1) ---------- 338 SQL> alter session set events '10053 trace name context off';
Below is the corresponding execution plan (with a new estimation 162) and the part of the 10053 trace file related to the extension
============ Plan Table ============ ---------------------------------------+-----------------------------------+ | Id | Operation | Name | Rows | Bytes | Cost | Time | ---------------------------------------+-----------------------------------+ | 0 | SELECT STATEMENT | | | | 3 | | | 1 | SORT AGGREGATE | | 1 | 7 | | | | 2 | INDEX RANGE SCAN | T_EXT_UK_I| 162 | 1134 | 3 | 00:00:01 | ---------------------------------------+-----------------------------------+ Predicate Information: ---------------------- 2 - access("VPK_ID"=63148 AND "LAYER_CODE"='R') ===================================== Access path analysis for T_EXT_STAT *************************************** SINGLE TABLE ACCESS PATH Single Table Cardinality Estimation for T_EXT_STAT[T_EXT_STAT] SPD: Directive valid: dirid = 17542990197075222359, state = 5, flags = 1, loc = 1 {EC(98564)[2, 3]} SPD: Return code in qosdDSDirSetup: EXISTS, estType = TABLE Column (#2): VPK_ID(NUMBER) AvgLen: 5 NDV: 2899 Nulls: 0 Density: 0.000000 Min: 0.000000 Max: 62849.000000 Column (#3): NewDensity:0.001753, OldDensity:0.000001 BktCnt:6275.000000, PopBktCnt:6275.000000, PopValCnt:4, NDV:4 Column (#3): LAYER_CODE(VARCHAR2) AvgLen: 2 NDV: 4 Nulls: 0 Density: 0.000000 Histogram: Freq #Bkts: 4 UncompBkts: 6275 EndPtVals: 4 ActualVal: no Column (#9): SYS_STUMVIRBZA6_$QWEX6DE2NGQA1(NUMBER) AvgLen: 12 NDV: 4985 Nulls: 0 Density: 0.000000 Min: 0.000000 Max: 1980066.000000 ColGroup (#2, Index) T_EXT_UK_I Col#: 2 3 4 CorStregth: -1.00 ColGroup (#1, VC) SYS_STUMVIRBZA6_$QWEX6DE2NGQA1 Col#: 2 3 CorStregth: 2.33 ColGroup Usage:: PredCnt: 2 Matches Full: #1 Partial: Sel: 0.0002 Table: T_EXT_STAT Alias: T_EXT_STAT Card: Original: 806857.000000 Rounded: 162 Computed: 161.86 Non Adjusted: 161.86
Where we can notice that, this time, the CBO has used the extension to compute its rows estimation as far as 162 comes from the following formula:
E-rows = num_rows(t_ext_stat) * 1/(NDV(SYS_STUMVIRBZA6_$QWEX6DE2NGQA1)) E-rows = 806857* 1/(4985) = 161.856971 --> rounded to 162
But instead of setting the fix I would have preferred to delete histogram from the layer_code column so that both the extension and its column combination will not have histogram:
SQL> exec dbms_stats.gather_table_stats(user ,'t_ext_stat', method_opt => 'for all columns size 1'); SQL> SELECT column_name ,num_distinct ,density ,histogram FROM user_tab_col_statistics WHERE table_name = 'T_EXT_STAT' AND column_name in ('VPK_ID','LAYER_CODE','SYS_STUMVIRBZA6_$QWEX6DE2NGQA1'); COLUMN_NAME NUM_DISTINCT DENSITY HISTOGRAM ------------------------------ ------------ ---------- ---------- SYS_STUMVIRBZA6_$QWEX6DE2NGQA1 5238 .000190913 NONE LAYER_CODE 4 .25 NONE VPK_ID 2982 .000335345 NONE
In which case the extension would be used as shown below:
------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | ------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 1 | | 1 | SORT AGGREGATE | | 1 | 1 | 1 | |* 2 | INDEX RANGE SCAN| T_EXT_UK_I | 1 | 154 | 338 | ------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("VPK_ID"=63148 AND "LAYER_CODE"='R') Column (#2): VPK_ID(NUMBER) AvgLen: 5 NDV: 2982 Nulls: 0 Density: 0.000000 Min: 0.000000 Max: 62849.000000 Column (#3): LAYER_CODE(VARCHAR2) AvgLen: 2 NDV: 4 Nulls: 0 Density: 0.000000 Column (#9): SYS_STUMVIRBZA6_$QWEX6DE2NGQA1(NUMBER) AvgLen: 12 NDV: 5238 Nulls: 0 Density: 0.000000 ColGroup (#2, Index) T_EXT_UK_I Col#: 2 3 4 CorStregth: -1.00 ColGroup (#1, VC) SYS_STUMVIRBZA6_$QWEX6DE2NGQA1 Col#: 2 3 CorStregth: 2.28 ColGroup Usage:: PredCnt: 2 Matches Full: #1 Partial: Sel: 0.0002 Table: T_EXT_STAT Alias: T_EXT_STAT Card: Original: 807515.000000 Rounded: 154 Computed: 154.16 Non Adjusted: 154.16
Where it is clearly shown that the extension has been used
E-rows = num_rows(t_ext_stat) * 1/(NDV(SYS_STUMVIRBZA6_$QWEX6DE2NGQA1)) E-rows = 807515 * 1/(5238) = 154.164758 --> rounded to 154
Notice by the way, that despite the extension has been used, the estimation is not as good as expected (154 instead of 338). An explanation of this discrepancy might come from the very weak correlation strenght that exist between the layer_code and the vpk_id (CorStregth: 2.30) which will be considered in a separate article
The bottom line of this article is : be careful about collecting histogram when you have the intention to use extended statistics. It is not necessary for the extension to present a skew if one of the columns from the combination has a histogram. And in such case Oracle will bypass the extension.