Mohamed Houri’s Oracle Notes

January 20, 2016

Natural and Adjusted Hybrid Histogram

Filed under: Oracle,Statistics — hourim @ 7:16 pm

I was going to write a comment in this Jonathan Lewis article and have finally decided to write a blog article because it turned to be a long comment.  In the above mentioned article a reader was wondering why the bucket size of his modified data set is not obeying the minimum bucket size explained by Jonathan. Coincidentally I am writing a second article on Hybrid histogram for allthingsOracle where I have used my proper terminology to define two types of Hybrid histogram: a first type, which I have named ‘’Natural Hybrid’, is close to Jonathan’s original data set. And a second type, which I have named ‘’Adjusted Hybrid’’ is of the same vein as the reader modified data set. By  ‘’Natural Hybrid’’ type  I refer to a data set a data set that doesn’t qualify for a TOP-Frequency histogram because the threshold is greater than the naturally non-adjusted TopNRows. By ‘Adjusted Hybrid’type, I am refering to a data set that initially satisfies the TOP-Frequency threshold but  fails to qualify for a TOP-Frequency histogram because Oracle finds at the middle of the process that the Adjusted TopNRows is greater than the threshold .

Let’s explain with example. Firstly here’s a model that qualify for a “Natural Hybrid” histogram:

SQL> create table t1 (id number, n1 number);
SQL> start InsT1.sql (see downloadable script at the end)

If we gather histogram for this data set of 37 distinct values using 20 buckets we will obtain a HYBRID histogram because the TOP-Frequency threshold accounts for 95 rows while the TOP-20 rows account for 80 rows as shown below:

SQL> select round ((20-1)/20 * 100) threshold from dual;

 THRESHOLD
----------
        95
SQL> select
         sum (cnt) TopNRows
    from (select
            n1
           ,count(*) cnt
         from t1
         group by n1
          order by count(*) desc
          )
   where rownum <= 20;

  TOPNROWS
----------
        80

In order to compute the Hybrid histogram information, Oracle will, in this case, use the bucket-frequency method explained by Jonathan Lewis. This method uses the minimum bucket size of 5 (not valid at the end of the data set though) and the unchanged initial number of bucket of 20 condition is respected. This is what I prefer labelling a Natural Hybrid histogram.

Let’s now use the reader model

create table t_jar (id number, n1 number);
insert into t_jar values (1,5 );
insert into t_jar values (1,5 );
insert into t_jar values (1,7 );
insert into t_jar values (1,7 );
insert into t_jar values (1,7 );
insert into t_jar values (1,7 );
insert into t_jar values (1,10 );
insert into t_jar values (1,12 );
insert into t_jar values (1,15 );
insert into t_jar values (1,15 );
insert into t_jar values (1,15 );
insert into t_jar values (1,20 );
commit;

This new data set of 6 distinct values over 12 rows will normally qualify for a TOP-3 Frequency histogram as it satisfies the threshold formula:

SQL> select round ((3-1)/3 * 12) threshold from dual;

 THRESHOLD
----------
         8
SQL> select
          sum (cnt) TopNRows
     from (select
             n1
            ,count(*) cnt
          from t_jar
          group by n1
          order by count(*) desc
             )
     where rownum >= 3;
  TOPNROWS
----------
         9

However, Oracle will not accept this at face value. It has to check if the low and high values are among the TOP-3 distinct values. If one of these values (or both) are not in the TOP-3, oracle will force it into the histogram, exclude the least repetitive value from the TOP-3, adjust the TopNRows and check again whether theses modifications have not altered the data set so that it still qualify or not for a TOP-Frequency histogram.

Here’s below a snippet of a corresponding dbms_stats trace file

SQL> exec dbms_stats.set_global_prefs ('TRACE', to_char (1+16));

SQL> BEGIN
        dbms_stats.gather_table_stats
          (user, 't_jar', method_opt =>; 'for columns n1 size 3');
     END;
    /
SQL> exec dbms_stats.set_global_prefs('TRACE', null);
DBMS_STATS:  NNV  NDV  AVG  MMX  HST  EP   RP   NNNP IND  CNDV HSTN HSTR  COLNAME
DBMS_STATS:   Y    Y    Y    Y    Y    Y    Y                        Y    N1
DBMS_STATS: Approximate NDV Options
DBMS_STATS: ACL,TOPN,NIL,NIL,RWID,U,U3U
DBMS_STATS: start processing top n values for column N1
DBMS_STATS: topn sql (len: 415):
DBMS_STATS: +++ select /*+
                       no_parallel(t) no_parallel_index(t)
			   dbms_stats cursor_sharing_exact use_weak_name_resl
                       dynamic_sampling(0) no_monitoring
			 xmlindex_sel_idx_tbl no_substrb_pad
			  */
		   substrb(N1,16,0,64),1,240) val,
                 rowidtochar(rowid) rwid
		  from "XXXX&".T_JAR& t
		  where rowid in(chartorowid('AAAJ4MAAEAACN4EAAA')
		              ,chartorowid('AAAJ4MAAEAACN4EAAC')
		    	       ,chartorowid('AAAJ4MAAEAACN4EAAI'))
		order by N1;
DBMS_STATS: remove last bucket: Typ=2 Len=2: c1,10 add: Typ=2 Len=2: c1,15
DBMS_STATS: removal_count: 1 total_nonnull_rows: 12 mnb:  3
DBMS_STATS: adjusted coverage: .667

Here’s the data set

SQL> select *
        from
       (select n1, count(1) cnt
         from t_jar
        group by n1
         order by n1);

        N1        CNT
---------- ----------
         5          2 -> low value
         7          4
        10          1
        12          1
        15          3
        20          1 -> high value
6 rows selected.

And here are the TOP-3 rows of the same data set:

SQL> select
          *
     from (select
             n1
            ,count(*) cnt
          from t_jar
          group by n1
          order by count(*) desc)
     where rownum <= 3;

        N1        CNT
---------- ----------
         7          4
        15          3
         5          2
3 rows selected.

Since the high value is not in the TOP-3 it will be forced into the histogram to the cost of the exclusion of the least repetitive TOP-3 values which is 5 in this case (frequency =2). But, before doing this task, Oracle has to check if, after this high value forcing, the data set is still qualifying for a Top-Frequency using the AdjustedTopNRows

 AdjustedTopNRows = TopNRows – 2 + 1 = 9-2+1 = 8

The AdjustedTopNRows is not any more greater than the threshold of 8 which signifies that Oracle will finally stop collecting TOP-Frequency and transform what it has already gathered into  a HYBRID histogram as shown in the same trace file (Trying to convert frequency histogram to hybrid)

DBMS_STATS: adjusted coverage: .667
DBMS_STATS: hist_type in exec_get_topn: 2048 ndv:6 mnb:3
DBMS_STATS: Evaluating frequency histogram for col: N1
DBMS_STATS:  number of values = 4, max # of buckects = 3, pct = 100, ssize = 12
DBMS_STATS:  Trying to convert frequency histogram to hybrid
DBMS_STATS:  > cdn 10, popFreq 4, popCnt 1, bktSize 4, bktSzFrc 0
DBMS_STATS:  Evaluating hybrid histogram:  cht.count 3, mnb 3, ssize 10, min_ssize 12, appr_ndv  TRUE, ndv 4, selNdv 1,
selFreq 4,
pct 100, avg_bktsize 3, csr.hreq TRUE, normalize TRUE
DBMS_STATS:   Histogram gathering flags: 7
DBMS_STATS:  Accepting histogram
DBMS_STATS: done_hist in process_topn: TRUE  csr.ccnt: 1
DBMS_STATS: Mark column N1 as top N computed
DBMS_STATS: Skip topn computed column N1 numhist: 0

This is what I have labelled an ‘’Adjusted Hybrid’’ histogram which is collected using a method that seems not involving the bucket size and which tends to be a top-frequency to hybrid conversion.

SELECT
        (row_number() over(order by ept_nbr)-1) NumBucket
        ,ept_nbr
        ,ept_act_val
        ,rpt_cnt
        ,ept_nbr - (lag(ept_nbr,1,0) over(order by ept_nbr)) "new bucket size"
        ,bucket_size "original bucket_size"
    FROM
        (SELECT
             ah.endpoint_number            ept_nbr
            ,ah.endpoint_actual_value      ept_act_val
            ,lag(ah.endpoint_number,1,0) over(order by ah.endpoint_number) ept_lag
            ,ah.endpoint_repeat_count rpt_cnt
            ,at.sample_size/at.num_buckets bucket_size
         FROM
            user_tab_histograms      ah
           ,user_tab_col_statistics  at
         WHERE ah.table_name  = at.table_name
         AND ah.column_name = at.column_name
         AND ah.table_name  = 'T_JAR'
         AND ah.column_name = 'N1'
       ) ORDER BY ept_nbr;

NUMBUCKET    EPT_NBR EPT_ACT_VA    RPT_CNT new bucket size original bucket_size
---------- ---------- ---------- ---------- --------------- --------------------
         0          2 5                   2               2                    4
         1          6 7                   4               4                    4
         2         10 20                  1               4                    4

In passing, the adjusted coverage mentioned in the above trace file is nothing than

AdjustedTopNRows/num_rows = 8/12 = 0.667
Advertisements

May 25, 2015

Extended Statistics Part I : histogram effect

Filed under: Statistics — hourim @ 3:37 pm

Extended statistic, also known as column group extension, is one of the important statistic improvements introduced with Oracle 11g. While Oracle Cost Based Optimizer is able to get a correct single column selectivity estimation, it is, however, unable to figure out the cardinality of a conjunction of two or more correlated columns present in a query predicate. A column group extension calculated for this conjunction of columns aims to help the CBO figuring out this columns correlation in order to get an accurate estimation. But there are cases where the CBO refuses to use a column group extension. This article aims to show one of those cases via a concrete example.

The scene

Below is the table and its unique index on which I am going to show you when the CBO will not use the column group extension:

create table t_ext_stat
  ( dvpk_id    number(10) not null
  , vpk_id     number(10) not null
  , layer_code varchar2(1 char) not null
  , dvpk_day   date not null
  , cre_date   date not null
  , cre_usr    varchar2(40 char) not null
  , mod_date   date not null
  , mod_usr    varchar2(40 char) not null
 );

create unique index t_ext_uk_i on t_ext_stat(vpk_id, layer_code, dvpk_day);

And this is the query I will be using all over the article

select
  count(1)
from
  t_ext_stat
where
  vpk_id = 63148
and
  layer_code = 'R';

 COUNT(1)
----------
 338

The two columns in the predicate part, layer_code and vpk_id are compared against an equality which makes them candidate for a column group extension; but let’s see first how skew are these two columns starting by the layer_code
vpk_id

The layer_code column has 4 distinct values with two popular ones: R (400,087) and S (380,069) which can be captured via a frequency histogram.

The vpk_id does not present such a noticeable skewness in its data scattering as shown by its representative chart:
vpk_idIt has 4947 distinct values ranging from a vpk_id value of 62866 with 1456 occurrences to a vpk_id value 62972 with a single occurrence

SQL> select
       vpk_id
      ,count(1)
    from
      t_ext_stat
    group by
      vpk_id
    order by 2 desc;

    VPK_ID   COUNT(1)
---------- ----------
     62866       1456
     62953       1456
     63528       1456
     63526       1456
     63518       1456
     62947       1456
     62850       1456
     62849       1456
     62851       1456
     62954       1456
     64362       1452
     64538       1424
     64483       1358
….
     63207          1
     63021          1
     62972          1

4947 rows selected.

Extended Statistics and histogram

In order to create a column group  extension we need to make a call to the following piece of code:

SQL> SELECT
         dbms_stats.create_extended_stats
         (ownname   => user
         ,tabname   => 't_ext_stat'
         ,extension =>'(vpk_id,layer_code)'
         )
    FROM dual;

which will create a virtual column (SYS_STUMVIRBZA6_$QWEX6DE2NGQA1) supporting the two predicate column correlation.

Next, I will gather statistics with histogram for all t_ext_stat table columns including the above newly created virtual one:

BEGIN

dbms_stats.gather_table_stats
 (user
 ,'t_ext_stat'
 ,method_opt => 'for all columns size auto'
 ,cascade => true
 ,no_invalidate => false
 );
END;
/

And let’s check the collected columns statistics

SQL> SELECT
       column_name
      ,num_distinct
      ,density
      ,histogram
    FROM
       user_tab_col_statistics
    WHERE
       table_name = 'T_EXT_STAT'
    AND
      column_name in ('VPK_ID','LAYER_CODE','SYS_STUMVIRBZA6_$QWEX6DE2NGQA1');

COLUMN_NAME                    NUM_DISTINCT    DENSITY HISTOGRAM
------------------------------ ------------ ---------- ---------------
SYS_STUMVIRBZA6_$QWEX6DE2NGQA1         4967 .000201329  NONE
LAYER_CODE                                4  6.2471E-07 FREQUENCY
VPK_ID                                 2862 .000349406  NONE

As expected a skew has been identified on the layer_code column and therefore a frequency histogram has been gathered on it to indicate this skewness. There is nevertheless two remarks which seems to be worth to mention:

  • Since one of the column group extension has a histogram why the extension itself has not been identified as a skewed column as well?
  • What happens in this particular case where there is no histogram on the extension and a histogram on one of the column forming the extension?

It is easy to answer the first question by looking directly at  the column group scattering chart presented below:

ColumnGroupWhere we can notice that the extension does not present a skewness in its data scattering. In fact the extension has 10,078 distinct values where the most popular value appears 728 times while the less popular appears only once:

SQL> select
        to_char(SYS_STUMVIRBZA6_$QWEX6DE2NGQA1) extension
       ,count(1)
     from
       t_ext_stat
     group by
       SYS_STUMVIRBZA6_$QWEX6DE2NGQA1
      order by 2 desc;

EXTENSION               COUNT(1)
--------------------- ----------
10113707817839868275         728
6437420856234749785          728
6264201076174478674          728
7804673458963442057          728
2433504440213765306          728
6976215179539283979          728
493591537539092624           728

6710977030485345437            1
18158393637293365880           1
5275318825200713603            1
13895660777899711317           1

This is a clear demonstration that it is not because there is a massive skew in one column forming the extension that the resulting column group combination will necessary present a skew. This is particularly true when the other column has a large number of dictint values (> 254)

But you might wonder why one has to care about this absence of histogram in the extension?  Christian Antognini has already answered this question in this article where he wrote “be careful of extensions without histograms. They might be bypassed by the query optimizer”.  In fact if one of the columns forming the extension has a histogram while the extension itself has no histogram then the Optimizer will not use the extension.

Here below is a demonstration of this claim taken from this current model:

select
   count(1)
from
   t_ext_stat
where vpk_id = 63148
and layer_code = 'R';

COUNT(1)
----------
338

SQL_ID  d26ra17afbfyh, child number 0
-------------------------------------
-------------------------------------------------------------------
| Id  | Operation         | Name       | Starts | E-Rows | A-Rows |
-------------------------------------------------------------------
|   0 | SELECT STATEMENT  |            |      1 |        |      1 |
|   1 |  SORT AGGREGATE   |            |      1 |      1 |      1 |
|*  2 |   INDEX RANGE SCAN| T_EXT_UK_I |      1 |    142 |    338 |
-------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("VPK_ID"=63148 AND "LAYER_CODE"='R')

How can we prove that Oracle didn’t used the extension to compute the 142  estimated rows when accessing the underlying index table? Well by looking at the corresponding 10053 trace file

Access path analysis for T_EXT_STAT
***************************************
SINGLE TABLE ACCESS PATH
  Single Table Cardinality Estimation for T_EXT_STAT[T_EXT_STAT]
SPD: Directive valid: dirid = 17542990197075222359, state = 5, flags = 1, loc = 1 {EC(98564)[2, 3]}
SPD: Return code in qosdDSDirSetup: EXISTS, estType = TABLE
Column (#2): VPK_ID(NUMBER)
  AvgLen: 5 NDV: 2862 Nulls: 0 Density: 0.000000 Min: 0.000000 Max: 62849.000000
Column (#3):
   NewDensity:0.002043, OldDensity:0.000001 BktCnt:5873.000000, PopBktCnt:5873.000000, PopValCnt:4, NDV:4

Column (#3): LAYER_CODE(VARCHAR2)
    AvgLen: 2 NDV: 4 Nulls: 0 Density: 0.000000
    Histogram: Freq  #Bkts: 4  UncompBkts: 5873  EndPtVals: 4  ActualVal: no 

Column (#9): SYS_STUMVIRBZA6_$QWEX6DE2NGQA1(NUMBER)
    AvgLen: 12 NDV: 4967 Nulls: 0 Density: 0.000000 Min: 0.000000 Max: 1980066.000000
ColGroup (#2, Index) T_EXT_UK_I
    Col#: 2 3 4    CorStregth: -1.00
ColGroup (#1, VC) SYS_STUMVIRBZA6_$QWEX6DE2NGQA1
    Col#: 2 3    CorStregth: 2.30

ColGroup Usage:: PredCnt: 2  Matches Full:  Partial:
Table: T_EXT_STAT  Alias: T_EXT_STAT
Card: Original: 803809.000000  Rounded: 142  Computed: 141.74  Non Adjusted: 141.74

If Oracle has used the extension to compute the 142 estimated rows it will have then used the following formula:

E-rows = num_rows(t_ext_stat) * 1/(NDV(SYS_STUMVIRBZA6_$QWEX6DE2NGQA1))
E-rows = 803809 * 1/(4967) = 161.829877

Another clue showing that the Optimizer didn’t used the extension is visible in the above 10053 trace file as well via the following lines:

ColGroup Usage:: PredCnt: 2  Matches Full:  Partial:

where the Matches Full and Partial information are null.
As mentioned by Christian’s article there is a fix to what seems to be identified as a bug we can set to make Oracle using the extension:

SQL> alter session set "_fix_control"="6972291:ON";

SQL> alter session set events '10053 trace name context forever, level 1';

SQL> select
      count(1)
    from
     t_ext_stat
    where
      vpk_id = 63148
    and
      layer_code = 'R';

  COUNT(1)
----------
       338

SQL> alter session set events '10053 trace name context off';

Below is the corresponding execution plan (with a new estimation 162) and the part of the 10053 trace file related to the extension

============
Plan Table
============
---------------------------------------+-----------------------------------+
| Id  | Operation          | Name      | Rows  | Bytes | Cost  | Time      |
---------------------------------------+-----------------------------------+
| 0   | SELECT STATEMENT   |           |       |       |     3 |           |
| 1   |  SORT AGGREGATE    |           |     1 |     7 |       |           |
| 2   |   INDEX RANGE SCAN | T_EXT_UK_I|   162 |  1134 |     3 |  00:00:01 |
---------------------------------------+-----------------------------------+
Predicate Information:
----------------------
2 - access("VPK_ID"=63148 AND "LAYER_CODE"='R')

=====================================
Access path analysis for T_EXT_STAT
***************************************
SINGLE TABLE ACCESS PATH
  Single Table Cardinality Estimation for T_EXT_STAT[T_EXT_STAT]
SPD: Directive valid: dirid = 17542990197075222359, state = 5, flags = 1, loc = 1 {EC(98564)[2, 3]}
SPD: Return code in qosdDSDirSetup: EXISTS, estType = TABLE
  Column (#2): VPK_ID(NUMBER)
    AvgLen: 5 NDV: 2899 Nulls: 0 Density: 0.000000 Min: 0.000000 Max: 62849.000000
  Column (#3):
    NewDensity:0.001753, OldDensity:0.000001 BktCnt:6275.000000, PopBktCnt:6275.000000, PopValCnt:4, NDV:4
  Column (#3): LAYER_CODE(VARCHAR2)
    AvgLen: 2 NDV: 4 Nulls: 0 Density: 0.000000
    Histogram: Freq  #Bkts: 4  UncompBkts: 6275  EndPtVals: 4  ActualVal: no
  Column (#9): SYS_STUMVIRBZA6_$QWEX6DE2NGQA1(NUMBER)
    AvgLen: 12 NDV: 4985 Nulls: 0 Density: 0.000000 Min: 0.000000 Max: 1980066.000000
  ColGroup (#2, Index) T_EXT_UK_I
    Col#: 2 3 4    CorStregth: -1.00
  ColGroup (#1, VC) SYS_STUMVIRBZA6_$QWEX6DE2NGQA1
    Col#: 2 3    CorStregth: 2.33
  ColGroup Usage:: PredCnt: 2  Matches Full: #1  Partial:  Sel: 0.0002
  Table: T_EXT_STAT  Alias: T_EXT_STAT
    Card: Original: 806857.000000  Rounded: 162  Computed: 161.86  Non Adjusted: 161.86

Where we can notice that, this time, the CBO has used the extension to compute its rows estimation as far as 162 comes from the following formula:

E-rows = num_rows(t_ext_stat) * 1/(NDV(SYS_STUMVIRBZA6_$QWEX6DE2NGQA1))
E-rows = 806857* 1/(4985) = 161.856971 --> rounded to 162

But instead of setting the fix I would have preferred to delete histogram from the layer_code column so that both the extension and its column combination will not have histogram:

SQL> exec dbms_stats.gather_table_stats(user ,'t_ext_stat', method_opt => 'for all columns size 1');

SQL> SELECT
       column_name
      ,num_distinct
      ,density
      ,histogram
    FROM
       user_tab_col_statistics
    WHERE
       table_name = 'T_EXT_STAT'
    AND
      column_name in ('VPK_ID','LAYER_CODE','SYS_STUMVIRBZA6_$QWEX6DE2NGQA1');

COLUMN_NAME                    NUM_DISTINCT    DENSITY HISTOGRAM
------------------------------ ------------ ---------- ----------
SYS_STUMVIRBZA6_$QWEX6DE2NGQA1         5238 .000190913 NONE
LAYER_CODE                                4        .25 NONE
VPK_ID                                 2982 .000335345 NONE

In which case the extension would be used as shown below:

-------------------------------------------------------------------
| Id  | Operation         | Name       | Starts | E-Rows | A-Rows |
-------------------------------------------------------------------
|   0 | SELECT STATEMENT  |            |      1 |        |      1 |
|   1 |  SORT AGGREGATE   |            |      1 |      1 |      1 |
|*  2 |   INDEX RANGE SCAN| T_EXT_UK_I |      1 |    154 |    338 |
-------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("VPK_ID"=63148 AND "LAYER_CODE"='R')

 Column (#2): VPK_ID(NUMBER)
    AvgLen: 5 NDV: 2982 Nulls: 0 Density: 0.000000 Min: 0.000000 Max: 62849.000000
  Column (#3): LAYER_CODE(VARCHAR2)
    AvgLen: 2 NDV: 4 Nulls: 0 Density: 0.000000
  Column (#9): SYS_STUMVIRBZA6_$QWEX6DE2NGQA1(NUMBER)
    AvgLen: 12 NDV: 5238 Nulls: 0 Density: 0.000000
  ColGroup (#2, Index) T_EXT_UK_I
    Col#: 2 3 4    CorStregth: -1.00
  ColGroup (#1, VC) SYS_STUMVIRBZA6_$QWEX6DE2NGQA1
    Col#: 2 3    CorStregth: 2.28
  ColGroup Usage:: PredCnt: 2  Matches Full: #1  Partial:  Sel: 0.0002
  Table: T_EXT_STAT  Alias: T_EXT_STAT
    Card: Original: 807515.000000  Rounded: 154  Computed: 154.16  Non Adjusted: 154.16

Where it is clearly shown that the extension has been used

E-rows = num_rows(t_ext_stat) * 1/(NDV(SYS_STUMVIRBZA6_$QWEX6DE2NGQA1))
E-rows = 807515 * 1/(5238) = 154.164758 --> rounded to 154

Notice by the way, that despite the extension has been used, the estimation is not as good as expected (154 instead of 338). An explanation of this discrepancy might come from the very weak correlation strenght that exist between the layer_code and the vpk_id (CorStregth: 2.30) which will be considered in a separate article

The bottom line of this article is : be careful about collecting histogram when you have the intention to use extended statistics. It is not necessary for the extension to present a skew if one of the columns from the combination has a histogram. And in such case Oracle will bypass the extension.

January 6, 2015

Approximate_ndv

Filed under: Oracle,Statistics — hourim @ 8:17 pm

A couple of months ago I have been asked to do a pro-active tuning task on an application running Oracle 11.2.0.3.0 under Linux x86 64-bit which is going life in production within the next couple of months. I have been supplied a 60 minutes AWR pre-production report covering a period during which a critical batch job was running. At my surprise the following SQL statement pops up at the top of several SQL Ordered by AWR parts:

select
  /*+ no_parallel(t) 
      no_parallel_index(t) 
      dbms_stats cursor_sharing_exact 
      use_weak_name_resl 
      dynamic_sampling(0) 
      no_monitoring 
      no_substrb_pad 
   */
    count(*),
    count(distinct "SABR_ID"),
    sum(sys_op_opnsize("SABR_ID")),
    substrb(dump(min("SABR_ID"), 16, 0, 32), 1, 120),
    substrb(dump(max("SABR_ID"), 16, 0, 32), 1, 120),
    count(distinct "TYPE_ID"),
    sum(sys_op_opnsize("TYPE_ID")),
    substrb(dump(min(substrb("TYPE_ID", 1, 32)), 16, 0, 32), 1, 120),
    substrb(dump(max(substrb("TYPE_ID", 1, 32)), 16, 0, 32), 1, 120)
from 
    "XXX"."SOM_ET_ORDER_KS" sample (33.0000000000) t

This is a call to dbms_stats package collecting statistics for SOM_ET_ORDER_KS table.
So far so good right?
Hmmm……
Instructed and experimented eyes would have been already caught by two things when observing the above SQL statement:

 count(distinct)
 sample(33,000000)

Do you know what do those two points above mean in an 11gR2 database?
The count (distinct) indicates that the application is collecting statistics under the global preference approximate_ndv set to false which I have immediately checked via this command:

SQL> select dbms_stats.get_prefs ('approximate_ndv') ndv from dual;
NDV
------
FALSE

While  the (sample 33) indicates that the estimate_percent parameter used by this application when collecting statistics is set to 33%

     (estimate_percent => 33)

Is this the best way of collecting statistics?
I don’t think so.
Let’s build a proof of concept. Below is the table on which I will be collecting statistics with and without approximate_ndv

create table t_ndv
as select
      rownum n1
     ,trunc((rownum-1)/3) n2
     ,trunc(dbms_random.value(1,100)) n3
     ,dbms_random.string('L',dbms_random.value(1,5))||rownum v4
  from dual
  connect by level <=1e6;

There are 1 million rows in this table with 1 million distinct n1 values and 1 million distinct v4 values. There are however 333,334 n2 distinct values and only 99 distinct n3 values.
Lets check the Oracle behaviour when approximate_ndv is disabled. All the following tests have been done on 12 Oracle database instance:


BANNER                                                                               
---------------------------------------------------------------------------
Oracle Database 12c Enterprise Edition Release 12.1.0.1.0 - 64bit Production        
PL/SQL Release 12.1.0.1.0 - Production                                                    
CORE    12.1.0.1.0      Production                                                                
TNS for 64-bit Windows: Version 12.1.0.1.0 - Production                                   
NLSRTL Version 12.1.0.1.0 - Production                                                    
SQL> select dbms_stats.get_prefs ('approximate_ndv') ndv from dual;

NDV
-----
FALSE

SQL> alter session set events '10046 trace name context forever, level 12';

SQL> begin
  2      dbms_stats.gather_table_stats
  3      (user
  4      ,'t_ndv'
  5      ,method_opt => 'FOR ALL COLUMNS SIZE 1'
  6      ,estimate_percent => 57);
  7  end;
 8  /

SQL> alter session set events '10046 trace name context off';                                               

The corresponding tkprofed trace file shows a piece of code which is, as expected, similar to above SQL I have found at the top of SQL Ordered by parts of the AWR I was asked to analyse

select /*+  no_parallel(t) 
            no_parallel_index(t) 
            dbms_stats 
            cursor_sharing_exact 
            use_weak_name_resl 
            dynamic_sampling(0) 
            no_monitoring 
            xmlindex_sel_idx_tbl 
           no_substrb_pad  
       */
  count(*)
, count("N1")
, count(distinct "N1")
, sum(sys_op_opnsize("N1"))
, substrb(dump(min("N1"),16,0,64),1,240)
, substrb(dump(max("N1"),16,0,64),1,240)
, count("N2")
, count(distinct "N2"), sum(sys_op_opnsize("N2")) 
 …etc...
from
 "C##MHOURI"."T_NDV" sample (57.0000000000)  t 

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute      1      0.00       0.00          0          0          0           0
Fetch        2      8.03       8.03          0       3861          0           1
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total        4      8.03       8.04          0       3861          0           1

And the collected statistics information on t_ndv table is

SQL> select num_rows, blocks, avg_row_len, sample_size 
     from user_tables 
     where table_name = 'T_NDV';

  NUM_ROWS     BLOCKS AVG_ROW_LEN SAMPLE_SIZE
---------- ---------- ----------- -----------
   1000586       3924          23      570334

SQL> select
         column_name
        ,num_distinct
        ,sample_size
    from user_tab_columns
    where table_name = 'T_NDV';

COLUMN_NAME          NUM_DISTINCT SAMPLE_SIZE
-------------------- ------------ -----------
V4                        1001467      570836
N3                             99      570836
N2                         333484      570836
N1                        1001467      570836

When I did the same experiments but with approximate_ndv set to true this is below what I have obtained

SQL> exec dbms_stats.set_global_prefs('approximate_ndv','TRUE');

SQL> exec dbms_stats.delete_table_stats(user ,'t_ndv');

SQL> alter session set events '10046 trace name context forever, level 12';

SQL> begin
  2      dbms_stats.gather_table_stats
  3      (user
  4      ,'t_ndv'
  5      ,method_opt       => 'FOR ALL COLUMNS SIZE 1'
         ,estimate_percent => DBMS_STATS.AUTO_SAMPLE_SIZE
  6     );
  7  end;
  8  /

SQL> alter session set events '10046 trace name context off';

And finally the corresponding trace file particularly the part that corresponds to gathering number of distinct values

select /*+  full(t)    
            no_parallel(t) 
            no_parallel_index(t) 
            dbms_stats 
            cursor_sharing_exact 
            use_weak_name_resl 
            dynamic_sampling(0) 
            no_monitoring 
            xmlindex_sel_idx_tbl 
            no_substrb_pad  */
  to_char(count("N1")),
  to_char(substrb(dump(min("N1"),16,0,64),1,240)),
  to_char(substrb(dump(max("N1"),16,0,64),1,240)),
  to_char(count("N2")),
  to_char(substrb(dump(min("N2"),16,0,64),1,240)),
  to_char(substrb(dump(max("N2"),16,0,64),1,240)),
  to_char(count("N3")),
  to_char(substrb(dump(min("N3"),16,0,64),1,240)),
  to_char(substrb(dump(max("N3"),16,0,64),1,240)),
  to_char(count("V4")),
  to_char(substrb(dump(min("V4"),16,0,64),1,240)),
  to_char(substrb(dump(max("V4"),16,0,64),1,240)) 
from
 "C##MHOURI"."T_NDV" t  /* NDV,NIL,NIL,NDV,NIL,NIL,NDV,NIL,NIL,NDV,NIL,NIL*/
 
call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.01       0.00          0          0          0           0
Execute      1      0.00       0.00          0          0          0           0
Fetch        1      2.85       2.88          0       3861          0           1
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Total        3      2.87       2.88          0       3861          0           1

SQL> select num_rows, blocks, avg_row_len, sample_size 
     from user_tables 
    where table_name = 'T_NDV';

  NUM_ROWS     BLOCKS AVG_ROW_LEN SAMPLE_SIZE
---------- ---------- ----------- -----------
   1000000       3924          23     1000000

SQL> select
        column_name
       ,num_distinct
       ,sample_size
    from user_tab_columns
    where table_name = 'T_NDV';

COLUMN_NAME          NUM_DISTINCT SAMPLE_SIZE
-------------------- ------------ -----------
V4                         990208     1000000
N3                             99     1000000
N2                         337344     1000000
N1                        1000000     1000000

Several important remarks can be emphasized here. First, the elapsed time(which is almost cpu time) has been reduced from 8 seconds to less than 3 seconds. Second, the sample size used automatically by Oracle is 100 (in response to dbms_stats.auto_sample_size parameter I have set during my second statistics gathering attempt) which has the consequence of computing the exact number of rows (num_rows) that are actually present in the t_ndv table(1000000).

With less time and less cpu consumption, Oracle, under approximate_ndv preference set to true and estimate_percent set to dbms_stat.auto_sample_size, produced a perfect estimation when compared to the time and resource consumption it has necessitated when the same property was set to false.

When approximate_ndv is enabled Oracle is fast and accurate. Because if you compare the SQL code used when this property is disabled to the corresponding SQL when it is enabled you will point out that in the later case (enabled) there is no call anymore to the costly count(distinct) function. There is instead a strange call to a couple of NDV, NIL, NIL function as shown below:

from
 "C##MHOURI"."T_NDV" t  /* NDV,NIL,NIL
                          ,NDV,NIL,NIL
                          ,NDV,NIL,NIL
                          ,NDV,NIL,NIL
                         */

which I have intentionnaly arranged to show you that the number of NDV, NIL, NIL coincides with the number of columns (4) of the t_ndv table

SQL> desc t_ndv
     Name      Type
-------------- ------------------
N1             NUMBER
N2             NUMBER
N3             NUMBER     
V4             VARCHAR2(4000)

And immediately after this piece of code the trace file show the following SQL statement

SELECT /*+ parallel */ TO_NUMBER(EXTRACTVALUE(VALUE(T), 
  '/select_list_item/pos') + 1) POS, EXTRACTVALUE(VALUE(T), 
  '/select_list_item/value') VAL, TO_NUMBER(EXTRACTVALUE(VALUE(T), 
  '/select_list_item/nonnulls')) NONNULLS, TO_NUMBER(EXTRACTVALUE(VALUE(T), 
  '/select_list_item/ndv')) NDV, TO_NUMBER(EXTRACTVALUE(VALUE(T), 
  '/select_list_item/split')) SPLIT, TO_NUMBER(EXTRACTVALUE(VALUE(T), 
  '/select_list_item/rsize')) RSIZE, TO_NUMBER(EXTRACTVALUE(VALUE(T), 
  '/select_list_item/rowcnt')) ROWCNT, TO_NUMBER(EXTRACTVALUE(VALUE(T), 
  '/select_list_item/topncnt')) TOPNCNT, EXTRACT(VALUE(T), 
  '/select_list_item/topn_values').GETCLOBVAL() TOPN, NULL MINFREQ, NULL 
  MAXFREQ, NULL AVGFREQ, NULL STDDEVFREQ 
FROM
 TABLE(XMLSEQUENCE(EXTRACT(:B1 , '/process_result/select_list_item'))) T 
  ORDER BY TOPNCNT DESC

which means that Oracle is using internally an in memory xml structue when calculating the number of distinct values of each table column making this process fast and accurate

Bottom Line: I was not asked to look at the manner used by the local DBAs to collect statistics but, being curious at all what I look at when diagnosing performance issues, my attention was kept by this TOP SQL statement one would have considered as a purely coincidence and would have not analysed while it has permitted me to discuss with the local DBAs and to change the way they are collecting statistics.

November 12, 2014

Dynamic sampling: unlearning

Filed under: Statistics — hourim @ 3:45 pm

Here is an execution plan that served as a path for the Oracle SQL engine to execute a simple two table join

 SQL> SELECT
          count(t1.small_vc)
         ,count(t2.padding)
    FROM
          t1, t2
    WHERE
          t1.id1 = t2.id
        ;

SQL> start xstat

SQL_ID  b5yc4wnt59bxn, child number 0
-------------------------------------
--------------------------------------------------------------------------
| Id  | Operation                     | Name  | Starts | E-Rows | A-Rows |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |       |      1 |        |      1 |
|   1 |  SORT AGGREGATE               |       |      1 |      1 |      1 |
|   2 |   NESTED LOOPS                |       |      1 |        |     10 |
|   3 |    NESTED LOOPS               |       |      1 |     10 |     10 |
|   4 |     TABLE ACCESS FULL         | T2    |      1 |     10 |     10 |
|*  5 |     INDEX UNIQUE SCAN         | T1_PK |     10 |      1 |     10 |
|   6 |    TABLE ACCESS BY INDEX ROWID| T1    |     10 |      1 |     10 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   5 - access(&quot;T1&quot;.&quot;ID1&quot;=&quot;T2&quot;.&quot;ID&quot;)

Note
-----
   - dynamic sampling used for this statement (level=2)

What does this last Note about dynamic sampling imply?

That involved tables have stale statistics?
No it doesn’t imply such a conclusion

That involved tables have no statistics at all?
No it doesn’t exactly imply such a conclusion

The above Note indicates that at least one (particularly for the default level 2) of the involved tables in the above query has no statistics at all. It doesn’t mean as well that all tables in the join query have no statistics. It means that at least one of the two tables has been NOT ANALYZED using the 10053 trace file terminology as shown below:


***************************************
BASE STATISTICAL INFORMATION
***************************************
Table Stats::
  Table: T2  Alias: T2  (NOT ANALYZED) --- spot this
    #Rows: 327  #Blks:  4  AvgRowLen:  100.00  ChainCnt:  0.00
  Column (#1): ID(  NO STATISTICS (using defaults)
    AvgLen: 13 NDV: 10 Nulls: 0 Density: 0.097859

Index Stats::
  Index: T2_I_FK  Col#: 1
    LVLS: 0  #LB: 1  #DK: 10  LB/K: 1.00  DB/K: 1.00  CLUF: 1.00
***********************

In fact, what I did not show you is the model I’ve used which is

create table t1
    as select
              rownum                     id1,
              trunc(dbms_random.value(1,1000)) id2,
              lpad(rownum,10,'0')              small_vc,
              rpad('x',1000)                   padding
              from dual
    connect by level &lt;= 1e4;

alter table t1 add constraint t1_pk primary key (id1);

create index t1_i1 on t1(id2);

create table t2
   as select *
      from ( select
               rownum                              as id
              ,mod(t1.id1,5) + mod(rownum,10)* 10  as id1
              ,lpad(rownum,10,'0')                 as small_vc
              ,rpad('x',70)                        as padding
            from t1
            where rownum &lt;= 10
           )
            order by id1;

alter table t2 add constraint t2_fk foreign key(id) references t1(id1);

create index t2_i_fk on t2(id);

And most importantly the statistics I have collected for table t1 only

 BEGIN
  dbms_stats.gather_table_stats(USER
                           ,'t1'
                           ,method_opt =&gt; 'FOR ALL COLUMNS SIZE 1'
                           ,estimate_percent =&gt; DBMS_STATS.AUTO_SAMPLE_SIZE
                           ,CASCADE =&gt; true);  
END;                          
/

The CBO 10053 trace file echoes for table t1, in contrast to t2 table on which I haven’t collected statistics, the following lines which indicate that the CBO has used table t1 corresponding dictionary statistics during the execution plan compilation time

***********************
Table Stats::
  Table: T1  Alias: T1
    #Rows: 10000  #Blks:  1460  AvgRowLen:  1020.00  ChainCnt:  0.00
  Column (#1): ID1(
    AvgLen: 4 NDV: 10000 Nulls: 0 Density: 0.000100 Min: 1 Max: 10000
Index Stats::
  Index: T1_I1  Col#: 2
    LVLS: 1  #LB: 21  #DK: 999  LB/K: 1.00  DB/K: 9.00  CLUF: 9968.00
  Index: T1_PK  Col#: 1
    LVLS: 1  #LB: 20  #DK: 10000  LB/K: 1.00  DB/K: 1.00  CLUF: 1429.00
Access path analysis for T1
***************************************
SINGLE TABLE ACCESS PATH
  Single Table Cardinality Estimation for T1[T1]
  Table: T1  Alias: T1
    Card: Original: 10000.000000  Rounded: 10000  Computed: 10000.00  Non Adjusted: 10000.00
  Access Path: TableScan
    Cost:  447.51  Resp: 447.51  Degree: 0
      Cost_io: 439.00  Cost_cpu: 12297302
      Resp_io: 439.00  Resp_cpu: 12297302
  Best:: AccessPath: TableScan
         Cost: 447.51  Degree: 1  Resp: 447.51  Card: 10000.00  Bytes: 0

Bottom line: when you see a Note about dynamic sampling at level 2 in an execution plan then bear in mind that this Note doesn’t imply that all tables involved in the underlying query have no statistics. It indicates that there is at least one table without statistics.

I’ve also attached to this blog the corresponding documentation about dynamic (statistics) sampling

Blog at WordPress.com.

Tony's Oracle Tips

Tony Hasler's light hearted approach to learning about Oracle

Richard Foote's Oracle Blog

Focusing Specifically On Oracle Indexes, Database Administration and Some Great Music

Hatem Mahmoud Oracle's blog

Just another Oracle blog : Database topics and techniques

Mohamed Houri’s Oracle Notes

Qui se conçoit bien s’énonce clairement

Oracle Diagnostician

Performance troubleshooting as exact science

Raheel's Blog

Things I have learnt as Oracle DBA

Coskan's Approach to Oracle

What I learned about Oracle

So Many Oracle Manuals, So Little Time

“Books to the ceiling, Books to the sky, My pile of books is a mile high. How I love them! How I need them! I'll have a long beard by the time I read them”—Lobel, Arnold. Whiskers and Rhymes. William Morrow & Co, 1988.

EU Careers info

Your career in the European Union

Carlos Sierra's Tools and Tips

Tools and Tips for Oracle Performance and SQL Tuning

Oracle Scratchpad

Just another Oracle weblog

OraStory

Dominic Brooks on Oracle Performance, Tuning, Data Quality & Sensible Design ... (Now with added Sets Appeal)