Asking Oracle 12c to collect histogram for a data set with a number of distinct values (**NDV**) greater than the supplied number of buckets (**N**) will probably result in Hybrid histogram. There is indeed a probability that a **TOP-Frequency** histogram will be preferred to Hybrid histogram. To qualify for this alternative the data set must fulfil a pre-requisite: its **TOP N** distinct values should occupy a number of rows exceeding a certain **threshold**. If this pre-requisite is satisfied a kind of frequency histogram is collected for the TOP N values; hence the TOP-Frequency label. The remaining insignificant values will be neglected and not captured into the histogram tables. However, if the extreme values of the data set, that is the **low **and **high **values, are among the insignificant values, they will not be neglected. Tow TOP N distinct values will be selected to serve as a victim, excluded from the histogram information and replaced in the TOP N captured values by the low and high values. These victim values are those with the lowest frequency of appearance in the data set. Whether the low (or high) value is naturally present or forced in the TOP N captured values determines the threshold value and the cardinality estimate of the non-popular non-captured values and that of the low (or high) value as well.

Let’s put all what precedes in action.

create table TopFreq3 as select rownum n1 , case when mod(rownum, 100000) = 0 then 90 when mod(rownum, 10000) = 0 then 180 when mod(rownum, 1000)= 0 then 84 when mod(rownum, 100) = 0 then 125 when mod(rownum,50) = 2 then 7 when mod(rownum-1,80) = 2 then 22 when mod(rownum, 10) = 0 then 19 when mod(rownum-1,10) = 5 then 15 when mod(rownum-1,5) = 1 then 11 when trunc((rownum -1/3)) < 5 then 25 when trunc((rownum -1/5)) < 20 then 33 else 42 end n2 from dual connect by level <= 2e2; SELECT n2, count(*) FROM TopFreq3 GROUP BY n2 ORDER BY n2; N2 COUNT(*) ---------- ---------- 7 4 --> low value 11 36 15 20 19 18 22 3 25 3 33 8 42 106 125 2 --> high value 9 rows selected.

As shown above the engineered data set has 9 distinct values (over 200 rows) of which the TOP 8 seem to qualify for a TOP-Frequency histogram:

undefine nb undefine num_rows SELECT &&nb numberOfBuckets ,TopNRows ,round((1-(1/&&nb))* &&num_rows) Threshold ,case when (TopNRows – round((1-(1/&&nb))* &&num_rows)) > 0 then 'Qualify For TOP Frequency' else 'Do not qualify For TOP Frequency' end TOPFreqDecision FROM (select sum (cnt) TopNRows from (select n2 , count(*) cnt from TopFreq3 group by n2 order by count(*) desc ) where rownum <= &&nb ); Enter value for nb: 8 Enter value for num_rows: 200 NUMBEROFBUCKETS TOPNROWS THRESHOLD TOPFREQDECISION --------------- ---------- ---------- -------------------------------- 8 198 175 Qualify For TOP Frequency

The **TOPNROWS **being greater than the **THRESHOLD **a **TOP-FREQUENCY** will be preferred to the HYBRID histogram as shown in the following:

begin dbms_stats.gather_table_stats (user,'TopFreq3' ,method_opt =>'for columns n2 size 8'); end; / select column_name ,num_distinct ,num_buckets ,sample_size ,histogram from user_tab_col_statistics where table_name = 'TOPFREQ3' and column_name = 'N2'; COLUMN_NAME NUM_DISTINCT NUM_BUCKETS SAMPLE_SIZE HISTOGRAM ------------ ------------ ----------- ----------- --------------- N2 9 8 200 TOP-FREQUENCY

Indeed as expected Oracle has gathered a TOP-FREQUENCY histogram for the TOP 8 distinct values as shown below:

select endpoint_actual_value value, endpoint_number, endpoint_number - (lag(endpoint_number,1,0) over(order by endpoint_number)) value_count from user_tab_histograms where table_name = 'TOPFREQ3' and column_name = 'N2' order by endpoint_number ; VALUE ENDPOINT_NUMBER VALUE_COUNT ---------- --------------- ----------- 7 4 4 11 40 36 15 60 20 19 78 18 25 81 3 33 89 8 42 195 106 125 196 1 --> high value with a forced frequency 1 8 rows selected.

Notice by the way, how, despite the high value 125 is insignificant (count =2) it has been captured into the histogram with bizarrely a forced frequency of 1. For the sake of clarity let’s reprint below how the n2 column is spread in TopFreq3 table:

N2 COUNT(*) ---------- ---------- 7 4 --> low value 11 36 15 20 19 18 22 3 --> first least popular value 25 3 33 8 42 106 125 2 --> high value 9 rows selected.

Again notice how, the value 22 which is the first least popular value, has been removed from the TOP 8 to let its place for the insignificant high value 125 with a forced frequency of 1. This is how the TOP-Frequency seems to work. If the low and/or high value of the data set is naturally present in the TOP-N values then there will be no value exclusion-substitution. If however one of these values (or both) is outside the TOP-N values, then it must be included in the histogram information in place of the least popular value.

Let’s now check what cardinality estimates will be calculated for the non-captured values (including the excluded least popular value)

explain plan for select count(1) from TopFreq3 where n2= 22; select * from table(dbms_xplan.display); ----------------------------------------------- | Id | Operation | Name | Rows | ----------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 1 | SORT AGGREGATE | | 1 | |* 2 | TABLE ACCESS FULL| TOPFREQ3 | 4 | ----------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - filter("N2"=22)

How Oracle did work out this cardinality of 4? In this case Oracle is simply using the following formula:

E-Rows = num_rows * NewDensity

Where NewDensity is:

NewDensity = (sample_size-TopNRows)/(num_distinct-num_buckets)/sample_size NewDensity = (200-198)/(9-8)/200 = .01

Finally applying the above mentioned cardinality estimation we get this:

E-Rows = num_rows * NewDensity E-Rows = 200 * 0.01 = 2

**But **this is not equal to the correct cardinality estimation of 4. In fact, when one of the extreme values are forced into the histogram information the threshold (or TopNRows) are slightly adapted to take into account the exclusion-inclusion operation done between the extreme value and the least popular value:

AdjustedTopNRows = TopNRows – (count of least popular value) + 1 (forced value of extreme value) AdjustedTopNRows = 198 – 3 + 1 = 196

Apply this to the above cardinality estimation formula and you will get the correct estimation as shown below:

NewDensity = (200- 196)/(9-8)/200 = .02 E-Rows = num_rows * NewDensity = 200 * 0.02 = 4

And this is clearly backed by the corresponding 10053 trace file (spot the NewDensity and the BktCnt):

SINGLE TABLE ACCESS PATH Single Table Cardinality Estimation for TOPFREQ3[TOPFREQ3] SPD: Return code in qosdDSDirSetup: NOCTX, estType = TABLE Column (#2): NewDensity:0.020000, OldDensity:0.002500 BktCnt:196.000000, PopBktCnt:195.000000, PopValCnt:7, NDV:9 Column (#2): N2(NUMBER) AvgLen: 4 NDV: 9 Nulls: 0 Density: 0.020000 Min: 7.000000 Max: 125.000000 Histogram: Top-Freq #Bkts: 196 UncompBkts: 196 EndPtVals: 8 ActualVal: yes Table: TOPFREQ3 Alias: TOPFREQ3 Card: Original: 200.000000 Rounded: 4 Computed: 4.000000 Non Adjusted: 4.000000

**Bottom line**: this article aims to restrain the enthusiasm of those using the ‘‘half least popular value’’ rule to get the selectivity of a non-popular frequency histogram before they extend it to the non-popular TOP-Frequency. It seems that this is not correct. And all in all, it seems that a non-popular TOP-FREQUENCY value is nothing else than a HYBRID non-popular value.

## Leave a Reply