I was going to write a comment in this Jonathan Lewis article and have finally decided to write a blog article because it turned to be a long comment. In the above mentioned article a reader was wondering why the bucket size of his modified data set is not obeying the minimum bucket size explained by Jonathan. Coincidentally I am writing a second article on Hybrid histogram for allthingsOracle where I have used my proper terminology to define two types of Hybrid histogram: a first type, which I have named ‘’Natural Hybrid’, is close to Jonathan’s original data set. And a second type, which I have named ‘’Adjusted Hybrid’’ is of the same vein as the reader modified data set. By ‘’Natural Hybrid’’ type I refer to a data set a data set that doesn’t qualify for a TOP-Frequency histogram because the threshold is greater than the naturally non-adjusted TopNRows. By ‘’Adjusted Hybrid’’ type, I am refering to a data set that initially satisfies the TOP-Frequency threshold but fails to qualify for a TOP-Frequency histogram because Oracle finds at the middle of the process that the Adjusted TopNRows is greater than the threshold .
Let’s explain with example. Firstly here’s a model that qualify for a “Natural Hybrid” histogram:
SQL> create table t1 (id number, n1 number); SQL> start InsT1.sql (see downloadable script at the end)
If we gather histogram for this data set of 37 distinct values using 20 buckets we will obtain a HYBRID histogram because the TOP-Frequency threshold accounts for 95 rows while the TOP-20 rows account for 80 rows as shown below:
SQL> select round ((20-1)/20 * 100) threshold from dual; THRESHOLD ---------- 95 SQL> select sum (cnt) TopNRows from (select n1 ,count(*) cnt from t1 group by n1 order by count(*) desc ) where rownum <= 20; TOPNROWS ---------- 80
In order to compute the Hybrid histogram information, Oracle will, in this case, use the bucket-frequency method explained by Jonathan Lewis. This method uses the minimum bucket size of 5 (not valid at the end of the data set though) and the unchanged initial number of bucket of 20 condition is respected. This is what I prefer labelling a Natural Hybrid histogram.
Let’s now use the reader model
create table t_jar (id number, n1 number); insert into t_jar values (1,5 ); insert into t_jar values (1,5 ); insert into t_jar values (1,7 ); insert into t_jar values (1,7 ); insert into t_jar values (1,7 ); insert into t_jar values (1,7 ); insert into t_jar values (1,10 ); insert into t_jar values (1,12 ); insert into t_jar values (1,15 ); insert into t_jar values (1,15 ); insert into t_jar values (1,15 ); insert into t_jar values (1,20 ); commit;
This new data set of 6 distinct values over 12 rows will normally qualify for a TOP-3 Frequency histogram as it satisfies the threshold formula:
SQL> select round ((3-1)/3 * 12) threshold from dual; THRESHOLD ---------- 8 SQL> select sum (cnt) TopNRows from (select n1 ,count(*) cnt from t_jar group by n1 order by count(*) desc ) where rownum >= 3; TOPNROWS ---------- 9
However, Oracle will not accept this at face value. It has to check if the low and high values are among the TOP-3 distinct values. If one of these values (or both) are not in the TOP-3, oracle will force it into the histogram, exclude the least repetitive value from the TOP-3, adjust the TopNRows and check again whether theses modifications have not altered the data set so that it still qualify or not for a TOP-Frequency histogram.
Here’s below a snippet of a corresponding dbms_stats trace file
SQL> exec dbms_stats.set_global_prefs ('TRACE', to_char (1+16)); SQL> BEGIN dbms_stats.gather_table_stats (user, 't_jar', method_opt =>; 'for columns n1 size 3'); END; / SQL> exec dbms_stats.set_global_prefs('TRACE', null);
DBMS_STATS: NNV NDV AVG MMX HST EP RP NNNP IND CNDV HSTN HSTR COLNAME DBMS_STATS: Y Y Y Y Y Y Y Y N1 DBMS_STATS: Approximate NDV Options DBMS_STATS: ACL,TOPN,NIL,NIL,RWID,U,U3U DBMS_STATS: start processing top n values for column N1 DBMS_STATS: topn sql (len: 415): DBMS_STATS: +++ select /*+ no_parallel(t) no_parallel_index(t) dbms_stats cursor_sharing_exact use_weak_name_resl dynamic_sampling(0) no_monitoring xmlindex_sel_idx_tbl no_substrb_pad */ substrb(N1,16,0,64),1,240) val, rowidtochar(rowid) rwid from "XXXX&".T_JAR& t where rowid in(chartorowid('AAAJ4MAAEAACN4EAAA') ,chartorowid('AAAJ4MAAEAACN4EAAC') ,chartorowid('AAAJ4MAAEAACN4EAAI')) order by N1; DBMS_STATS: remove last bucket: Typ=2 Len=2: c1,10 add: Typ=2 Len=2: c1,15 DBMS_STATS: removal_count: 1 total_nonnull_rows: 12 mnb: 3 DBMS_STATS: adjusted coverage: .667
Here’s the data set
SQL> select * from (select n1, count(1) cnt from t_jar group by n1 order by n1); N1 CNT ---------- ---------- 5 2 -> low value 7 4 10 1 12 1 15 3 20 1 -> high value 6 rows selected.
And here are the TOP-3 rows of the same data set:
SQL> select * from (select n1 ,count(*) cnt from t_jar group by n1 order by count(*) desc) where rownum <= 3; N1 CNT ---------- ---------- 7 4 15 3 5 2 3 rows selected.
Since the high value is not in the TOP-3 it will be forced into the histogram to the cost of the exclusion of the least repetitive TOP-3 values which is 5 in this case (frequency =2). But, before doing this task, Oracle has to check if, after this high value forcing, the data set is still qualifying for a Top-Frequency using the AdjustedTopNRows
AdjustedTopNRows = TopNRows – 2 + 1 = 9-2+1 = 8
The AdjustedTopNRows is not any more greater than the threshold of 8 which signifies that Oracle will finally stop collecting TOP-Frequency and transform what it has already gathered into a HYBRID histogram as shown in the same trace file (Trying to convert frequency histogram to hybrid)
DBMS_STATS: adjusted coverage: .667 DBMS_STATS: hist_type in exec_get_topn: 2048 ndv:6 mnb:3 DBMS_STATS: Evaluating frequency histogram for col: N1 DBMS_STATS: number of values = 4, max # of buckects = 3, pct = 100, ssize = 12 DBMS_STATS: Trying to convert frequency histogram to hybrid DBMS_STATS: > cdn 10, popFreq 4, popCnt 1, bktSize 4, bktSzFrc 0 DBMS_STATS: Evaluating hybrid histogram: cht.count 3, mnb 3, ssize 10, min_ssize 12, appr_ndv TRUE, ndv 4, selNdv 1, selFreq 4, pct 100, avg_bktsize 3, csr.hreq TRUE, normalize TRUE DBMS_STATS: Histogram gathering flags: 7 DBMS_STATS: Accepting histogram DBMS_STATS: done_hist in process_topn: TRUE csr.ccnt: 1 DBMS_STATS: Mark column N1 as top N computed DBMS_STATS: Skip topn computed column N1 numhist: 0
This is what I have labelled an ‘’Adjusted Hybrid’’ histogram which is collected using a method that seems not involving the bucket size and which tends to be a top-frequency to hybrid conversion.
SELECT (row_number() over(order by ept_nbr)-1) NumBucket ,ept_nbr ,ept_act_val ,rpt_cnt ,ept_nbr - (lag(ept_nbr,1,0) over(order by ept_nbr)) "new bucket size" ,bucket_size "original bucket_size" FROM (SELECT ah.endpoint_number ept_nbr ,ah.endpoint_actual_value ept_act_val ,lag(ah.endpoint_number,1,0) over(order by ah.endpoint_number) ept_lag ,ah.endpoint_repeat_count rpt_cnt ,at.sample_size/at.num_buckets bucket_size FROM user_tab_histograms ah ,user_tab_col_statistics at WHERE ah.table_name = at.table_name AND ah.column_name = at.column_name AND ah.table_name = 'T_JAR' AND ah.column_name = 'N1' ) ORDER BY ept_nbr; NUMBUCKET EPT_NBR EPT_ACT_VA RPT_CNT new bucket size original bucket_size ---------- ---------- ---------- ---------- --------------- -------------------- 0 2 5 2 2 4 1 6 7 4 4 4 2 10 20 1 4 4
In passing, the adjusted coverage mentioned in the above trace file is nothing than
AdjustedTopNRows/num_rows = 8/12 = 0.667