May | 2015 | Mohamed Houri’s Oracle Notes

May 25, 2015

Extended Statistics Part I : histogram effect

Filed under: Statistics — hourim @ 3:37 pm

Extended statistic, also known as column group extension, is one of the important statistic improvements introduced with Oracle 11g. While Oracle Cost Based Optimizer is able to get a correct single column selectivity estimation, it is, however, unable to figure out the cardinality of a conjunction of two or more correlated columns present in a query predicate. A column group extension calculated for this conjunction of columns aims to help the CBO figuring out this columns correlation in order to get an accurate estimation. But there are cases where the CBO refuses to use a column group extension. This article aims to show one of those cases via a concrete example.

The scene

Below is the table and its unique index on which I am going to show you when the CBO will not use the column group extension:

create table t_ext_stat
  ( dvpk_id    number(10) not null
  , vpk_id     number(10) not null
  , layer_code varchar2(1 char) not null
  , dvpk_day   date not null
  , cre_date   date not null
  , cre_usr    varchar2(40 char) not null
  , mod_date   date not null
  , mod_usr    varchar2(40 char) not null
 );

create unique index t_ext_uk_i on t_ext_stat(vpk_id, layer_code, dvpk_day);

And this is the query I will be using all over the article

select
  count(1)
from
  t_ext_stat
where
  vpk_id = 63148
and
  layer_code = 'R';

 COUNT(1)
----------
 338

The two columns in the predicate part, layer_code and vpk_id are compared against an equality which makes them candidate for a column group extension; but let’s see first how skew are these two columns starting by the layer_code

The layer_code column has 4 distinct values with two popular ones: R (400,087) and S (380,069) which can be captured via a frequency histogram.

The vpk_id does not present such a noticeable skewness in its data scattering as shown by its representative chart:
It has 4947 distinct values ranging from a vpk_id value of 62866 with 1456 occurrences to a vpk_id value 62972 with a single occurrence

SQL> select
       vpk_id
      ,count(1)
    from
      t_ext_stat
    group by
      vpk_id
    order by 2 desc;

    VPK_ID   COUNT(1)
---------- ----------
     62866       1456
     62953       1456
     63528       1456
     63526       1456
     63518       1456
     62947       1456
     62850       1456
     62849       1456
     62851       1456
     62954       1456
     64362       1452
     64538       1424
     64483       1358
….
     63207          1
     63021          1
     62972          1

4947 rows selected.

Extended Statistics and histogram

In order to create a column group extension we need to make a call to the following piece of code:

SQL> SELECT
         dbms_stats.create_extended_stats
         (ownname   => user
         ,tabname   => 't_ext_stat'
         ,extension =>'(vpk_id,layer_code)'
         )
    FROM dual;

which will create a virtual column (SYS_STUMVIRBZA6_$QWEX6DE2NGQA1) supporting the two predicate column correlation.

Next, I will gather statistics with histogram for all t_ext_stat table columns including the above newly created virtual one:

BEGIN

dbms_stats.gather_table_stats
 (user
 ,'t_ext_stat'
 ,method_opt => 'for all columns size auto'
 ,cascade => true
 ,no_invalidate => false
 );
END;
/

And let’s check the collected columns statistics

SQL> SELECT
       column_name
      ,num_distinct
      ,density
      ,histogram
    FROM
       user_tab_col_statistics
    WHERE
       table_name = 'T_EXT_STAT'
    AND
      column_name in ('VPK_ID','LAYER_CODE','SYS_STUMVIRBZA6_$QWEX6DE2NGQA1');

COLUMN_NAME                    NUM_DISTINCT    DENSITY HISTOGRAM
------------------------------ ------------ ---------- ---------------
SYS_STUMVIRBZA6_$QWEX6DE2NGQA1         4967 .000201329  NONE
LAYER_CODE                                4  6.2471E-07 FREQUENCY
VPK_ID                                 2862 .000349406  NONE

As expected a skew has been identified on the layer_code column and therefore a frequency histogram has been gathered on it to indicate this skewness. There is nevertheless two remarks which seems to be worth to mention:

Since one of the column group extension has a histogram why the extension itself has not been identified as a skewed column as well?
What happens in this particular case where there is no histogram on the extension and a histogram on one of the column forming the extension?

It is easy to answer the first question by looking directly at the column group scattering chart presented below:

Where we can notice that the extension does not present a skewness in its data scattering. In fact the extension has 10,078 distinct values where the most popular value appears 728 times while the less popular appears only once:

SQL> select
        to_char(SYS_STUMVIRBZA6_$QWEX6DE2NGQA1) extension
       ,count(1)
     from
       t_ext_stat
     group by
       SYS_STUMVIRBZA6_$QWEX6DE2NGQA1
      order by 2 desc;

EXTENSION               COUNT(1)
--------------------- ----------
10113707817839868275         728
6437420856234749785          728
6264201076174478674          728
7804673458963442057          728
2433504440213765306          728
6976215179539283979          728
493591537539092624           728

6710977030485345437            1
18158393637293365880           1
5275318825200713603            1
13895660777899711317           1

This is a clear demonstration that it is not because there is a massive skew in one column forming the extension that the resulting column group combination will necessary present a skew. This is particularly true when the other column has a large number of dictint values (> 254)

But you might wonder why one has to care about this absence of histogram in the extension? Christian Antognini has already answered this question in this article where he wrote “be careful of extensions without histograms. They might be bypassed by the query optimizer”. In fact if one of the columns forming the extension has a histogram while the extension itself has no histogram then the Optimizer will not use the extension.

Here below is a demonstration of this claim taken from this current model:

select
   count(1)
from
   t_ext_stat
where vpk_id = 63148
and layer_code = 'R';

COUNT(1)
----------
338

SQL_ID  d26ra17afbfyh, child number 0
-------------------------------------
-------------------------------------------------------------------
| Id  | Operation         | Name       | Starts | E-Rows | A-Rows |
-------------------------------------------------------------------
|   0 | SELECT STATEMENT  |            |      1 |        |      1 |
|   1 |  SORT AGGREGATE   |            |      1 |      1 |      1 |
|*  2 |   INDEX RANGE SCAN| T_EXT_UK_I |      1 |    142 |    338 |
-------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("VPK_ID"=63148 AND "LAYER_CODE"='R')

How can we prove that Oracle didn’t used the extension to compute the 142 estimated rows when accessing the underlying index table? Well by looking at the corresponding 10053 trace file

Access path analysis for T_EXT_STAT
***************************************
SINGLE TABLE ACCESS PATH
  Single Table Cardinality Estimation for T_EXT_STAT[T_EXT_STAT]
SPD: Directive valid: dirid = 17542990197075222359, state = 5, flags = 1, loc = 1 {EC(98564)[2, 3]}
SPD: Return code in qosdDSDirSetup: EXISTS, estType = TABLE
Column (#2): VPK_ID(NUMBER)
  AvgLen: 5 NDV: 2862 Nulls: 0 Density: 0.000000 Min: 0.000000 Max: 62849.000000
Column (#3):
   NewDensity:0.002043, OldDensity:0.000001 BktCnt:5873.000000, PopBktCnt:5873.000000, PopValCnt:4, NDV:4

Column (#3): LAYER_CODE(VARCHAR2)
    AvgLen: 2 NDV: 4 Nulls: 0 Density: 0.000000
    Histogram: Freq  #Bkts: 4  UncompBkts: 5873  EndPtVals: 4  ActualVal: no 

Column (#9): SYS_STUMVIRBZA6_$QWEX6DE2NGQA1(NUMBER)
    AvgLen: 12 NDV: 4967 Nulls: 0 Density: 0.000000 Min: 0.000000 Max: 1980066.000000
ColGroup (#2, Index) T_EXT_UK_I
    Col#: 2 3 4    CorStregth: -1.00
ColGroup (#1, VC) SYS_STUMVIRBZA6_$QWEX6DE2NGQA1
    Col#: 2 3    CorStregth: 2.30

ColGroup Usage:: PredCnt: 2  Matches Full:  Partial:
Table: T_EXT_STAT  Alias: T_EXT_STAT
Card: Original: 803809.000000  Rounded: 142  Computed: 141.74  Non Adjusted: 141.74

If Oracle has used the extension to compute the 142 estimated rows it will have then used the following formula:

E-rows = num_rows(t_ext_stat) * 1/(NDV(SYS_STUMVIRBZA6_$QWEX6DE2NGQA1))
E-rows = 803809 * 1/(4967) = 161.829877

Another clue showing that the Optimizer didn’t used the extension is visible in the above 10053 trace file as well via the following lines:

ColGroup Usage:: PredCnt: 2  Matches Full:  Partial:

where the Matches Full and Partial information are null.
As mentioned by Christian’s article there is a fix to what seems to be identified as a bug we can set to make Oracle using the extension:

SQL> alter session set "_fix_control"="6972291:ON";

SQL> alter session set events '10053 trace name context forever, level 1';

SQL> select
      count(1)
    from
     t_ext_stat
    where
      vpk_id = 63148
    and
      layer_code = 'R';

  COUNT(1)
----------
       338

SQL> alter session set events '10053 trace name context off';

Below is the corresponding execution plan (with a new estimation 162) and the part of the 10053 trace file related to the extension

============
Plan Table
============
---------------------------------------+-----------------------------------+
| Id  | Operation          | Name      | Rows  | Bytes | Cost  | Time      |
---------------------------------------+-----------------------------------+
| 0   | SELECT STATEMENT   |           |       |       |     3 |           |
| 1   |  SORT AGGREGATE    |           |     1 |     7 |       |           |
| 2   |   INDEX RANGE SCAN | T_EXT_UK_I|   162 |  1134 |     3 |  00:00:01 |
---------------------------------------+-----------------------------------+
Predicate Information:
----------------------
2 - access("VPK_ID"=63148 AND "LAYER_CODE"='R')

=====================================
Access path analysis for T_EXT_STAT
***************************************
SINGLE TABLE ACCESS PATH
  Single Table Cardinality Estimation for T_EXT_STAT[T_EXT_STAT]
SPD: Directive valid: dirid = 17542990197075222359, state = 5, flags = 1, loc = 1 {EC(98564)[2, 3]}
SPD: Return code in qosdDSDirSetup: EXISTS, estType = TABLE
  Column (#2): VPK_ID(NUMBER)
    AvgLen: 5 NDV: 2899 Nulls: 0 Density: 0.000000 Min: 0.000000 Max: 62849.000000
  Column (#3):
    NewDensity:0.001753, OldDensity:0.000001 BktCnt:6275.000000, PopBktCnt:6275.000000, PopValCnt:4, NDV:4
  Column (#3): LAYER_CODE(VARCHAR2)
    AvgLen: 2 NDV: 4 Nulls: 0 Density: 0.000000
    Histogram: Freq  #Bkts: 4  UncompBkts: 6275  EndPtVals: 4  ActualVal: no
  Column (#9): SYS_STUMVIRBZA6_$QWEX6DE2NGQA1(NUMBER)
    AvgLen: 12 NDV: 4985 Nulls: 0 Density: 0.000000 Min: 0.000000 Max: 1980066.000000
  ColGroup (#2, Index) T_EXT_UK_I
    Col#: 2 3 4    CorStregth: -1.00
  ColGroup (#1, VC) SYS_STUMVIRBZA6_$QWEX6DE2NGQA1
    Col#: 2 3    CorStregth: 2.33
  ColGroup Usage:: PredCnt: 2  Matches Full: #1  Partial:  Sel: 0.0002
  Table: T_EXT_STAT  Alias: T_EXT_STAT
    Card: Original: 806857.000000  Rounded: 162  Computed: 161.86  Non Adjusted: 161.86

Where we can notice that, this time, the CBO has used the extension to compute its rows estimation as far as 162 comes from the following formula:

E-rows = num_rows(t_ext_stat) * 1/(NDV(SYS_STUMVIRBZA6_$QWEX6DE2NGQA1))
E-rows = 806857* 1/(4985) = 161.856971 --> rounded to 162

But instead of setting the fix I would have preferred to delete histogram from the layer_code column so that both the extension and its column combination will not have histogram:

SQL> exec dbms_stats.gather_table_stats(user ,'t_ext_stat', method_opt => 'for all columns size 1');

SQL> SELECT
       column_name
      ,num_distinct
      ,density
      ,histogram
    FROM
       user_tab_col_statistics
    WHERE
       table_name = 'T_EXT_STAT'
    AND
      column_name in ('VPK_ID','LAYER_CODE','SYS_STUMVIRBZA6_$QWEX6DE2NGQA1');

COLUMN_NAME                    NUM_DISTINCT    DENSITY HISTOGRAM
------------------------------ ------------ ---------- ----------
SYS_STUMVIRBZA6_$QWEX6DE2NGQA1         5238 .000190913 NONE
LAYER_CODE                                4        .25 NONE
VPK_ID                                 2982 .000335345 NONE

In which case the extension would be used as shown below:

-------------------------------------------------------------------
| Id  | Operation         | Name       | Starts | E-Rows | A-Rows |
-------------------------------------------------------------------
|   0 | SELECT STATEMENT  |            |      1 |        |      1 |
|   1 |  SORT AGGREGATE   |            |      1 |      1 |      1 |
|*  2 |   INDEX RANGE SCAN| T_EXT_UK_I |      1 |    154 |    338 |
-------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("VPK_ID"=63148 AND "LAYER_CODE"='R')

 Column (#2): VPK_ID(NUMBER)
    AvgLen: 5 NDV: 2982 Nulls: 0 Density: 0.000000 Min: 0.000000 Max: 62849.000000
  Column (#3): LAYER_CODE(VARCHAR2)
    AvgLen: 2 NDV: 4 Nulls: 0 Density: 0.000000
  Column (#9): SYS_STUMVIRBZA6_$QWEX6DE2NGQA1(NUMBER)
    AvgLen: 12 NDV: 5238 Nulls: 0 Density: 0.000000
  ColGroup (#2, Index) T_EXT_UK_I
    Col#: 2 3 4    CorStregth: -1.00
  ColGroup (#1, VC) SYS_STUMVIRBZA6_$QWEX6DE2NGQA1
    Col#: 2 3    CorStregth: 2.28
  ColGroup Usage:: PredCnt: 2  Matches Full: #1  Partial:  Sel: 0.0002
  Table: T_EXT_STAT  Alias: T_EXT_STAT
    Card: Original: 807515.000000  Rounded: 154  Computed: 154.16  Non Adjusted: 154.16

Where it is clearly shown that the extension has been used

E-rows = num_rows(t_ext_stat) * 1/(NDV(SYS_STUMVIRBZA6_$QWEX6DE2NGQA1))
E-rows = 807515 * 1/(5238) = 154.164758 --> rounded to 154

Notice by the way, that despite the extension has been used, the estimation is not as good as expected (154 instead of 338). An explanation of this discrepancy might come from the very weak correlation strenght that exist between the layer_code and the vpk_id (CorStregth: 2.30) which will be considered in a separate article

The bottom line of this article is : be careful about collecting histogram when you have the intention to use extended statistics. It is not necessary for the extension to present a skew if one of the columns from the combination has a histogram. And in such case Oracle will bypass the extension.

May 12, 2015

Index Efficiency

Filed under: Index — hourim @ 7:03 am

I used Jonathan Lewis script to locate degenerated indexes –-or indexes that are occupying more space than they should–. Among those indexes I have isolated this one:

16:20:33:TABLE1 - PK_TAB1
Current Leaf blocks: 2,846,555 Target size:1,585,492

According to this SQL script the above index possesses 2.8 million worth of leaf blocks while it should normally occupy half this number of blocks.

The sys_op_lb_id function when applied on this index gives the following average leaf block per index key picture:

ROWS_PER_BLOCK BLOCKS
-------------- ----------
 2              1
 7              1
 27             1
 32             1
 92             1
 94             1
 103            1
 107            1
 108            1
 111            1
 112            800
 113            1,627,529
……
 422            980,894
 423            40
 432            1
 434            1
 448            5496
 449            32803
 450            7
 456            3
 458            1
 466            1
 478            54
 479            200
 487            1
 ----------    -----------
sum             2,979,747

Spot that odd value of 1.6 million leaf blocks (out of a total of 2,9 million) we have to visit to get only 113 index keys. Add to this the other 980,984 leaf blocks we need to visit to get an extra 422 index keys and you might end up by approximatively having to visit the entire index leaf blocks to get only a couple of hundred of index keys. That is a completely degenerated index.

Let’s then rebuild it and check if we will get back the reclaimed space:

SQL> alter index PK_TAB1 rebuild parallel 8;

SQL> alter index PK_TAB1 noparallel;

SQL> break on report skip 1

SQL> compute sum of blocks on report

SQL> select
        rows_per_block,
        count(*) blocks
     from
       (
        select
            /*+
              cursor_sharing_exact
              dynamic_sampling(0)
              no_monitoring
              no_expand
              index_ffs(t1,t1_i1)
              noparallel_index(t,t1_i1)
             */
        sys_op_lbid( &m_ind_id ,'L',t1.rowid) as block_id,
        count(*) as rows_per_block
      from
        TABLE1 t1
      where
        tab_id is not null
      group by
       sys_op_lbid( &m_ind_id ,'L',t1.rowid)
     )
   group by rows_per_block
   order by rows_per_block
   ;
Enter value for m_ind_id: 53213
Enter value for m_ind_id: 53213

ROWS_PER_BLOCK BLOCKS
-------------- ----------
 26            1
 206           1
 208           1
 243           1
 249           1
 272           1
 316           1
 339           1
 422           1,558,800
 423           53
 432           1
 448           5496
 449           32803
 458           1
 478           54
 479           200
 487           1
 ----------------------
sum          1,597,417

Notice the new number of index leaf block we’ve got after rebuilding the index (1,597,417) and compare it with the number predicted by Jonathan Lewis script (1,585,492). That’s really very accurate. The initial estimation is almost 100% accurate. In passing the new index size has been reduced by at factor of 46%.

While rebuilding the index has reduced drastically the number of leaf blocks and the disk space they occupy, that odd value of 1,558,800 leaf blocks we have to visit to get only 422 index keys is still present. This has prompted me to try coalescing the index even though I was not very confident that such a high number of leaf blocks could be merged with adjacent leaf blocks making the index less smashed.

SQL> alter index PK_TAB1 coalesce;

ROWS_PER_BLOCK    BLOCKS
-------------- ----------
           26          1
          206          1
          208          1
          243          1
          249          1
          272          1
          316          1
          339          1
          422          1,558,800
          423          53
          432          1
          448          5496
          449          32803
          458          1
          478          54
          479          200
          487          1
              -------------
sum              1,597,417

Definitely this primary key index has a strange way of being filled up which I have to figure out with the Java developers.

The bottom line of this article is that Jonathan Lewis script locating degenerated index is amazingly precise.

Comments (11)

Mohamed Houri’s Oracle Notes

May 25, 2015

Extended Statistics Part I : histogram effect

May 12, 2015

Index Efficiency

Follow Blog via Email

Categories

Blog Stats

Blogs I Follow

Archives