Mohamed Houri’s Oracle Notes

January 7, 2016

12c TOP-Frequency and the ’’half least popular value’’

Filed under: Oracle — hourim @ 1:27 pm

Asking Oracle 12c to collect histogram for a data set with a number of distinct values (NDV) greater than the supplied number of buckets (N) will probably result in Hybrid histogram. There is indeed a probability that a TOP-Frequency histogram will be preferred to Hybrid histogram. To qualify for this alternative the data set must fulfil a pre-requisite: its TOP N distinct values should occupy a number of rows exceeding a certain threshold. If this pre-requisite is satisfied a kind of frequency histogram is collected for the TOP N values; hence the TOP-Frequency label. The remaining insignificant values will be neglected and not captured into the histogram tables. However, if the extreme values of the data set, that is the low and high values, are among the insignificant values, they will not be neglected. Tow TOP N distinct values will be selected to serve as a victim, excluded from the histogram information and replaced in the TOP N captured values by the low and high values. These victim values are those with the lowest frequency of appearance in the data set. Whether the low (or high) value is naturally present or forced in the TOP N captured values determines the threshold value and the cardinality estimate of the non-popular non-captured values and that of the low (or high) value as well.

Let’s put all what precedes in action.

create table TopFreq3 as
select
    rownum n1
	, case when mod(rownum, 100000)   = 0 then   90
	       when mod(rownum, 10000)    = 0 then   180
		  when mod(rownum, 1000)= 0 then   84
              when mod(rownum, 100)      = 0 then   125
              when mod(rownum,50)        = 2 then   7
              when mod(rownum-1,80)      = 2 then   22 
              when mod(rownum, 10)       = 0 then   19
              when mod(rownum-1,10)      = 5  then   15
              when mod(rownum-1,5)       = 1  then   11
              when trunc((rownum -1/3)) < 5  then   25
              when trunc((rownum -1/5)) < 20  then   33
	  else 42 
        end n2    
from dual
connect by level <= 2e2; 

SELECT n2, count(*)
FROM TopFreq3
GROUP BY n2
ORDER BY n2;

        N2   COUNT(*)
---------- ----------
         7          4 --> low value
        11         36
        15         20
        19         18
        22          3
        25          3
        33          8
        42        106
       125          2 --> high value

9 rows selected.

As shown above the engineered data set has 9 distinct values (over 200 rows) of which the TOP 8 seem to qualify for a TOP-Frequency histogram:

 undefine nb
undefine num_rows

 SELECT
       &&nb numberOfBuckets
      ,TopNRows
      ,round((1-(1/&&nb))* &&num_rows) Threshold
      ,case
        when (TopNRows – round((1-(1/&&nb))* &&num_rows)) > 0 then 'Qualify For TOP Frequency'
      else
         'Do not qualify For TOP Frequency'
      end TOPFreqDecision
   FROM
   (select
         sum (cnt) TopNRows
     from (select
             n2
           , count(*) cnt
         from TopFreq3
         group by n2
         order by count(*) desc
         )
      where rownum <= &&nb
    );
Enter value for nb: 8
Enter value for num_rows: 200

NUMBEROFBUCKETS   TOPNROWS  THRESHOLD TOPFREQDECISION
--------------- ---------- ---------- --------------------------------
              8        198        175 Qualify For TOP Frequency

The TOPNROWS being greater than the THRESHOLD a TOP-FREQUENCY will be preferred to the HYBRID histogram as shown in the following:

begin
  dbms_stats.gather_table_stats
       (user,'TopFreq3'
       ,method_opt =>'for columns n2 size 8');
end;
/

select
    column_name
   ,num_distinct
   ,num_buckets
   ,sample_size
   ,histogram
from
   user_tab_col_statistics
   where table_name = 'TOPFREQ3'
   and column_name  = 'N2';

COLUMN_NAME  NUM_DISTINCT NUM_BUCKETS SAMPLE_SIZE HISTOGRAM
------------ ------------ ----------- ----------- ---------------
N2                      9           8         200 TOP-FREQUENCY

Indeed as expected Oracle has gathered a TOP-FREQUENCY histogram for the TOP 8 distinct values as shown below:

select
       endpoint_actual_value value,
       endpoint_number,
       endpoint_number - (lag(endpoint_number,1,0) 
                         over(order by endpoint_number)) value_count
    from	
       user_tab_histograms
    where
       table_name = 'TOPFREQ3'
    and column_name = 'N2'
    order by
       endpoint_number
    ;

VALUE      ENDPOINT_NUMBER VALUE_COUNT
---------- --------------- -----------
7                        4           4 
11                      40          36
15                      60          20
19                      78          18
25                      81           3
33                      89           8
42                     195         106
125                    196           1 --> high value with a forced frequency 1

8 rows selected.

Notice by the way, how, despite the high value 125 is insignificant (count =2) it has been captured into the histogram with bizarrely a forced frequency of 1. For the sake of clarity let’s reprint below how the n2 column is spread in TopFreq3 table:

        N2   COUNT(*)
---------- ----------
         7          4 --> low value
        11         36
        15         20
        19         18
        22          3 --> first least popular value
        25          3 
        33          8
        42        106
       125          2 --> high value
9 rows selected.

Again notice how, the value 22 which is the first least popular value, has been removed from the TOP 8 to let its place for the insignificant high value 125 with a forced frequency of 1. This is how the TOP-Frequency seems to work. If the low and/or high value of the data set is naturally present in the TOP-N values then there will be no value exclusion-substitution. If however one of these values (or both) is outside the TOP-N values, then it must be included in the histogram information in place of the least popular value.

Let’s now check what cardinality estimates will be calculated for the non-captured values (including the excluded least popular value)

explain plan for select count(1) from TopFreq3 where n2= 22;

select * from table(dbms_xplan.display);

-----------------------------------------------
| Id  | Operation          | Name     | Rows  |
-----------------------------------------------
|   0 | SELECT STATEMENT   |          |     1 |
|   1 |  SORT AGGREGATE    |          |     1 |
|*  2 |   TABLE ACCESS FULL| TOPFREQ3 |     4 |
-----------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("N2"=22)

How Oracle did work out this cardinality of 4? In this case Oracle is simply using the following formula:

E-Rows = num_rows * NewDensity

Where NewDensity is:

NewDensity = (sample_size-TopNRows)/(num_distinct-num_buckets)/sample_size
NewDensity = (200-198)/(9-8)/200 = .01

Finally applying the above mentioned cardinality estimation we get this:

E-Rows = num_rows * NewDensity
E-Rows = 200 * 0.01 = 2

But this is not equal to the correct cardinality estimation of 4. In fact, when one of the extreme values are forced into the histogram information the threshold (or TopNRows) are slightly adapted to take into account the exclusion-inclusion operation done between the extreme value and the least popular value:

AdjustedTopNRows = TopNRows – (count of least popular value) + 1 (forced value of 
                                                                  extreme value)

AdjustedTopNRows = 198 – 3 + 1 = 196

Apply this to the above cardinality estimation formula and you will get the correct estimation as shown below:

NewDensity = (200- 196)/(9-8)/200   = .02
E-Rows     = num_rows * NewDensity  = 200 * 0.02 = 4

And this is clearly backed by the corresponding 10053 trace file (spot the NewDensity and the BktCnt):

SINGLE TABLE ACCESS PATH 
  Single Table Cardinality Estimation for TOPFREQ3[TOPFREQ3] 
  SPD: Return code in qosdDSDirSetup: NOCTX, estType = TABLE
  Column (#2): 
    NewDensity:0.020000, OldDensity:0.002500 BktCnt:196.000000, PopBktCnt:195.000000, PopValCnt:7, NDV:9
  Column (#2): N2(NUMBER)
    AvgLen: 4 NDV: 9 Nulls: 0 Density: 0.020000 Min: 7.000000 Max: 125.000000
    Histogram: Top-Freq  #Bkts: 196  UncompBkts: 196  EndPtVals: 8  ActualVal: yes
  Table: TOPFREQ3  Alias: TOPFREQ3
    Card: Original: 200.000000  Rounded: 4  Computed: 4.000000  Non Adjusted: 4.000000

Bottom line: this article aims to restrain the enthusiasm of those using the ‘‘half least popular value’’ rule to get the selectivity of a non-popular frequency histogram before they extend it to the non-popular TOP-Frequency. It seems that this is not correct. And all in all, it seems that a non-popular TOP-FREQUENCY value is nothing else than a HYBRID non-popular value.

Leave a Comment »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Create a free website or blog at WordPress.com.

Tony's Oracle Tips

Tony Hasler's light hearted approach to learning about Oracle

Richard Foote's Oracle Blog

Focusing Specifically On Oracle Indexes, Database Administration and Some Great Music

Hatem Mahmoud Oracle's blog

Just another Oracle blog : Database topics and techniques

Mohamed Houri’s Oracle Notes

Qui se conçoit bien s’énonce clairement

Oracle Diagnostician

Performance troubleshooting as exact science

Raheel's Blog

Things I have learnt as Oracle DBA

Coskan's Approach to Oracle

What I learned about Oracle

So Many Oracle Manuals, So Little Time

“Books to the ceiling, Books to the sky, My pile of books is a mile high. How I love them! How I need them! I'll have a long beard by the time I read them”—Lobel, Arnold. Whiskers and Rhymes. William Morrow & Co, 1988.

EU Careers info

Your career in the European Union

Carlos Sierra's Tools and Tips

Tools and Tips for Oracle Performance and SQL Tuning

Oracle Scratchpad

Just another Oracle weblog

OraStory

Dominic Brooks on Oracle Performance, Tuning, Data Quality & Sensible Design ... (Now with added Sets Appeal)

%d bloggers like this: