January | 2015 | Mohamed Houri’s Oracle Notes

January 6, 2015

Approximate_ndv

Filed under: Oracle,Statistics — hourim @ 8:17 pm

A couple of months ago I have been asked to do a pro-active tuning task on an application running Oracle 11.2.0.3.0 under Linux x86 64-bit which is going life in production within the next couple of months. I have been supplied a 60 minutes AWR pre-production report covering a period during which a critical batch job was running. At my surprise the following SQL statement pops up at the top of several SQL Ordered by AWR parts:

select
  /*+ no_parallel(t) 
      no_parallel_index(t) 
      dbms_stats cursor_sharing_exact 
      use_weak_name_resl 
      dynamic_sampling(0) 
      no_monitoring 
      no_substrb_pad 
   */
    count(*),
    count(distinct "SABR_ID"),
    sum(sys_op_opnsize("SABR_ID")),
    substrb(dump(min("SABR_ID"), 16, 0, 32), 1, 120),
    substrb(dump(max("SABR_ID"), 16, 0, 32), 1, 120),
    count(distinct "TYPE_ID"),
    sum(sys_op_opnsize("TYPE_ID")),
    substrb(dump(min(substrb("TYPE_ID", 1, 32)), 16, 0, 32), 1, 120),
    substrb(dump(max(substrb("TYPE_ID", 1, 32)), 16, 0, 32), 1, 120)
from 
    "XXX"."SOM_ET_ORDER_KS" sample (33.0000000000) t

This is a call to dbms_stats package collecting statistics for SOM_ET_ORDER_KS table.
So far so good right?
Hmmm……
Instructed and experimented eyes would have been already caught by two things when observing the above SQL statement:

 count(distinct)
 sample(33,000000)

Do you know what do those two points above mean in an 11gR2 database?
The count (distinct) indicates that the application is collecting statistics under the global preference approximate_ndv set to false which I have immediately checked via this command:

SQL> select dbms_stats.get_prefs ('approximate_ndv') ndv from dual;
NDV
------
FALSE

While the (sample 33) indicates that the estimate_percent parameter used by this application when collecting statistics is set to 33%

     (estimate_percent => 33)

Is this the best way of collecting statistics?
I don’t think so.
Let’s build a proof of concept. Below is the table on which I will be collecting statistics with and without approximate_ndv

create table t_ndv
as select
      rownum n1
     ,trunc((rownum-1)/3) n2
     ,trunc(dbms_random.value(1,100)) n3
     ,dbms_random.string('L',dbms_random.value(1,5))||rownum v4
  from dual
  connect by level <=1e6;

There are 1 million rows in this table with 1 million distinct n1 values and 1 million distinct v4 values. There are however 333,334 n2 distinct values and only 99 distinct n3 values.
Lets check the Oracle behaviour when approximate_ndv is disabled. All the following tests have been done on 12 Oracle database instance:


BANNER                                                                               
---------------------------------------------------------------------------
Oracle Database 12c Enterprise Edition Release 12.1.0.1.0 - 64bit Production        
PL/SQL Release 12.1.0.1.0 - Production                                                    
CORE    12.1.0.1.0      Production                                                                
TNS for 64-bit Windows: Version 12.1.0.1.0 - Production                                   
NLSRTL Version 12.1.0.1.0 - Production

SQL> select dbms_stats.get_prefs ('approximate_ndv') ndv from dual;

NDV
-----
FALSE

SQL> alter session set events '10046 trace name context forever, level 12';

SQL> begin
  2      dbms_stats.gather_table_stats
  3      (user
  4      ,'t_ndv'
  5      ,method_opt => 'FOR ALL COLUMNS SIZE 1'
  6      ,estimate_percent => 57);
  7  end;
 8  /

SQL> alter session set events '10046 trace name context off';

The corresponding tkprofed trace file shows a piece of code which is, as expected, similar to above SQL I have found at the top of SQL Ordered by parts of the AWR I was asked to analyse

select /*+  no_parallel(t) 
            no_parallel_index(t) 
            dbms_stats 
            cursor_sharing_exact 
            use_weak_name_resl 
            dynamic_sampling(0) 
            no_monitoring 
            xmlindex_sel_idx_tbl 
           no_substrb_pad  
       */
  count(*)
, count("N1")
, count(distinct "N1")
, sum(sys_op_opnsize("N1"))
, substrb(dump(min("N1"),16,0,64),1,240)
, substrb(dump(max("N1"),16,0,64),1,240)
, count("N2")
, count(distinct "N2"), sum(sys_op_opnsize("N2")) 
 …etc...
from
 "C##MHOURI"."T_NDV" sample (57.0000000000)  t 

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute      1      0.00       0.00          0          0          0           0
Fetch        2      8.03       8.03          0       3861          0           1
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total        4      8.03       8.04          0       3861          0           1

And the collected statistics information on t_ndv table is

SQL> select num_rows, blocks, avg_row_len, sample_size 
     from user_tables 
     where table_name = 'T_NDV';

  NUM_ROWS     BLOCKS AVG_ROW_LEN SAMPLE_SIZE
---------- ---------- ----------- -----------
   1000586       3924          23      570334

SQL> select
         column_name
        ,num_distinct
        ,sample_size
    from user_tab_columns
    where table_name = 'T_NDV';

COLUMN_NAME          NUM_DISTINCT SAMPLE_SIZE
-------------------- ------------ -----------
V4                        1001467      570836
N3                             99      570836
N2                         333484      570836
N1                        1001467      570836

When I did the same experiments but with approximate_ndv set to true this is below what I have obtained

SQL> exec dbms_stats.set_global_prefs('approximate_ndv','TRUE');

SQL> exec dbms_stats.delete_table_stats(user ,'t_ndv');

SQL> alter session set events '10046 trace name context forever, level 12';

SQL> begin
  2      dbms_stats.gather_table_stats
  3      (user
  4      ,'t_ndv'
  5      ,method_opt       => 'FOR ALL COLUMNS SIZE 1'
         ,estimate_percent => DBMS_STATS.AUTO_SAMPLE_SIZE
  6     );
  7  end;
  8  /

SQL> alter session set events '10046 trace name context off';

And finally the corresponding trace file particularly the part that corresponds to gathering number of distinct values

select /*+  full(t)    
            no_parallel(t) 
            no_parallel_index(t) 
            dbms_stats 
            cursor_sharing_exact 
            use_weak_name_resl 
            dynamic_sampling(0) 
            no_monitoring 
            xmlindex_sel_idx_tbl 
            no_substrb_pad  */
  to_char(count("N1")),
  to_char(substrb(dump(min("N1"),16,0,64),1,240)),
  to_char(substrb(dump(max("N1"),16,0,64),1,240)),
  to_char(count("N2")),
  to_char(substrb(dump(min("N2"),16,0,64),1,240)),
  to_char(substrb(dump(max("N2"),16,0,64),1,240)),
  to_char(count("N3")),
  to_char(substrb(dump(min("N3"),16,0,64),1,240)),
  to_char(substrb(dump(max("N3"),16,0,64),1,240)),
  to_char(count("V4")),
  to_char(substrb(dump(min("V4"),16,0,64),1,240)),
  to_char(substrb(dump(max("V4"),16,0,64),1,240)) 
from
 "C##MHOURI"."T_NDV" t  /* NDV,NIL,NIL,NDV,NIL,NIL,NDV,NIL,NIL,NDV,NIL,NIL*/
 
call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.01       0.00          0          0          0           0
Execute      1      0.00       0.00          0          0          0           0
Fetch        1      2.85       2.88          0       3861          0           1
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Total        3      2.87       2.88          0       3861          0           1

SQL> select num_rows, blocks, avg_row_len, sample_size 
     from user_tables 
    where table_name = 'T_NDV';

  NUM_ROWS     BLOCKS AVG_ROW_LEN SAMPLE_SIZE
---------- ---------- ----------- -----------
   1000000       3924          23     1000000

SQL> select
        column_name
       ,num_distinct
       ,sample_size
    from user_tab_columns
    where table_name = 'T_NDV';

COLUMN_NAME          NUM_DISTINCT SAMPLE_SIZE
-------------------- ------------ -----------
V4                         990208     1000000
N3                             99     1000000
N2                         337344     1000000
N1                        1000000     1000000

Several important remarks can be emphasized here. First, the elapsed time(which is almost cpu time) has been reduced from 8 seconds to less than 3 seconds. Second, the sample size used automatically by Oracle is 100 (in response to dbms_stats.auto_sample_size parameter I have set during my second statistics gathering attempt) which has the consequence of computing the exact number of rows (num_rows) that are actually present in the t_ndv table(1000000).

With less time and less cpu consumption, Oracle, under approximate_ndv preference set to true and estimate_percent set to dbms_stat.auto_sample_size, produced a perfect estimation when compared to the time and resource consumption it has necessitated when the same property was set to false.

When approximate_ndv is enabled Oracle is fast and accurate. Because if you compare the SQL code used when this property is disabled to the corresponding SQL when it is enabled you will point out that in the later case (enabled) there is no call anymore to the costly count(distinct) function. There is instead a strange call to a couple of NDV, NIL, NIL function as shown below:

from
 "C##MHOURI"."T_NDV" t  /* NDV,NIL,NIL
                          ,NDV,NIL,NIL
                          ,NDV,NIL,NIL
                          ,NDV,NIL,NIL
                         */

which I have intentionnaly arranged to show you that the number of NDV, NIL, NIL coincides with the number of columns (4) of the t_ndv table

SQL> desc t_ndv
     Name      Type
-------------- ------------------
N1             NUMBER
N2             NUMBER
N3             NUMBER     
V4             VARCHAR2(4000)

And immediately after this piece of code the trace file show the following SQL statement

SELECT /*+ parallel */ TO_NUMBER(EXTRACTVALUE(VALUE(T), 
  '/select_list_item/pos') + 1) POS, EXTRACTVALUE(VALUE(T), 
  '/select_list_item/value') VAL, TO_NUMBER(EXTRACTVALUE(VALUE(T), 
  '/select_list_item/nonnulls')) NONNULLS, TO_NUMBER(EXTRACTVALUE(VALUE(T), 
  '/select_list_item/ndv')) NDV, TO_NUMBER(EXTRACTVALUE(VALUE(T), 
  '/select_list_item/split')) SPLIT, TO_NUMBER(EXTRACTVALUE(VALUE(T), 
  '/select_list_item/rsize')) RSIZE, TO_NUMBER(EXTRACTVALUE(VALUE(T), 
  '/select_list_item/rowcnt')) ROWCNT, TO_NUMBER(EXTRACTVALUE(VALUE(T), 
  '/select_list_item/topncnt')) TOPNCNT, EXTRACT(VALUE(T), 
  '/select_list_item/topn_values').GETCLOBVAL() TOPN, NULL MINFREQ, NULL 
  MAXFREQ, NULL AVGFREQ, NULL STDDEVFREQ 
FROM
 TABLE(XMLSEQUENCE(EXTRACT(:B1 , '/process_result/select_list_item'))) T 
  ORDER BY TOPNCNT DESC

which means that Oracle is using internally an in memory xml structue when calculating the number of distinct values of each table column making this process fast and accurate

Bottom Line: I was not asked to look at the manner used by the local DBAs to collect statistics but, being curious at all what I look at when diagnosing performance issues, my attention was kept by this TOP SQL statement one would have considered as a purely coincidence and would have not analysed while it has permitted me to discuss with the local DBAs and to change the way they are collecting statistics.

Mohamed Houri’s Oracle Notes

January 6, 2015

Approximate_ndv

Follow Blog via Email

Categories

Blog Stats

Blogs I Follow

Archives