Where
there is a high degree of skew in the column distribution, called a non-uniform
distribution of data, histograms should lead to a better estimation of
selectivity. This should produce plans that are more likely to be optimal. That means histograms contains information on
nature of table data.
What is Histogram? Histograms are feature in
CBO and it helps to optimizer to determine how data are skewed(distributed)
with in the column. Histogram is good to create for the column which are
included in the WHERE clause where the column is highly skewed. Histogram helps
to optimizer to decide whether to use an index or full-table scan or help the
optimizer determine the fastest table join order.
What is a Bucket? When Histograms are created
the number of buckets can be specified.It is this number that controls the type
of histogram created.
# Buckets = # of Rows
of information.
When
building histograms the information it stores is interpreted differently depending
on whether the number of buckets requested is less than the number distinct
values or if it is the same.
Specifically, ENDPOINT_NUMBER and
ENDPOINT_VALUE
in dba/user/all_histograms would have different meanings.
Oracle uses two types of histograms for column statistics:
height-balanced histograms and frequency histograms
Frequency histograms
Each value of the column corresponds to a single bucket of
the histogram.
Each bucket contains the number of occurrences of that single
value.
Each value of the column corresponds to a single bucket of
the histogram. This is also called value based histogram. Each bucket contains
the number of occurrences of that single value. Frequency histograms are
automatically created instead of height-balanced histograms when the number of
distinct values is less than or equal to the number of histogram buckets
specified.
data sorted
Y Y
Y Y
Y Y
N Y
N N
NA N
N N
Y NA
NA NA
Results
Bucket1 Y=4
Bucket2 N=3
Bucket3 NA=2
Height-balanced histogram
In a height-balanced histogram, the column values are divided
into bands so that each band contains approximately the same number of rows. The
useful information that the histogram provides is where in the range of values
the endpoints fall.
A histogram is created using DBMS_STATS.GATHER_TABLE_STATS METHOD_OPT => 'FOR COLUMNS SIZE <#
of buckets> <Column Name>‘
Size determines the number of buckets to be created
execute dbms_stats.gather_table_stats(ownname => 'scott',
tabname => 'employee',METHOD_OPT => 'FOR COLUMNS SIZE 10 gender');-for a
particular column
execute dbms_stats.gather_table_stats(ownname => 'scott',
tabname => 'employee',METHOD_OPT => 'FOR ALL COLUMNS SIZE 10’);- for all
columns
execute dbms_stats.gather_table_stats(ownname => 'scott',
tabname => 'employee',METHOD_OPT => 'FOR all indexed columns size 10’);-
for all columns with index
EXAMPLE
-------
Table
TAB1
SQL>
desc tab1
Name Null? Type
------------------------------- -------- ----
A
NUMBER(6)
B
NUMBER(6)
Column
A contains unique values from 1 to 10000.
Column
B contains 10 distinct values. The value '5' occurs 9991 times. Values
'1,
2, 3, 4, 9996, 9997, 9998, 9999, 10000' occur only once.
Test
queries:
(1)
select * from tab1 where b=5;
(2)
select * from tab1 where b=3;
Both
the above queries would use a FULL TABLE SCAN as there is no other
access
method available.
Then
we create an index on column B.
select
lpad(INDEX_NAME,10), lpad(TABLE_NAME,10),
lpad(COLUMN_NAME,10), COLUMN_POSITION,
COLUMN_LENGTH
from
user_ind_columns
where
table_name='TAB1'
SQL>
/
LPAD(INDEX
LPAD(TABLE LPAD(COLUM COLUMN_POSITION COLUMN_LENGTH
----------
---------- ---------- --------------- -------------
TAB1_B
TAB1 B 1 22
Now,
(1)
select * from tab1 where b=5;
(2)
select * from tab1 where b=3;
Both
do an INDEX RANGE SCAN to get the ROWID to do a lookup in the table.
With
an INDEX present, it would preferrable to do an INDEX RANGE SCAN for
query
(2), but a FULL TABLE SCAN for query (1).
ANALYZING
THE TABLE
-------------------
Next,
analyze the table using compute statistics:
SQL>
execute dbms_stats.gather_table_stats(ownname => 'scott', tabname
=> 'tab1')
From
dba_tables:
NUM_ROWS
BLOCKS EMPTY_BLOCKS
AVG_SPACE CHAIN_CNT AVG_ROW_LEN
----------
---------- ------------ ---------- ---------- -----------
10000 64 0 86 0 10
From
dba_tab_columns:
NUM_DISTINCT
LOW HIGH DENSITY
NUM_NULLS NUM_BUCKETS LAST_ANALYZ SAMPLE_SIZE
------------
---- ---- --------- ---------- ----------- ----------- -----------
10000 Full Full .0001 0 1 30-JUN-1999 10000
10 Full Full .1 0 1 30-JUN-1999 10000
SQL>
select lpad(TABLE_NAME,10), lpad(COLUMN_NAME, 10),
2
bucket_number, endpoint_value
3 from
user_histograms
4
where table_name='TAB1';
TABLE_NAME
COLUMN_NAME BUCKET_NUMBER ENDPOINT_VALUE
---------- ----------- ------------- --------------
TAB1 A 0 1
TAB1 A 1 10000
TAB1 B 0 1
TAB1 B 1 10000
SQL> select lpad(TABLE_NAME,10), lpad(COLUMN_NAME,
10),
2 bucket_number, endpoint_value
3 from
user_tab_histograms
4
where table_name='TAB1';
Analyze has created 1
BUCKET for each column. So all values for the column
are in the same
bucket. The BUCKET_NUMBER represents the
BUCKET NUMBER and
ENDPOINT_VALUE
represents the last column value in that bucket.
Now
query (1) and (2) ; both do a FULL TABLE SCAN.
So,
the fact that you have statistics about the table and columns does not
help
the optimizer to distinguish between how many of each value we have.
The
reason it does a FULL TABLE SCAN is because there is a 1 BUCKET histogram
and
any value selected for should be in that bucket.
CREATING
HISTOGRAMS
-------------------
What
you need now is to create histograms so the Optimizer knows how many
values
occur for each column.
Query
(1): select * from tab1 where b=5;
should do a FULL TABLE SCAN and
Query
(2): select * from tab1 where b=3;
should do an INDEX RANGE SCAN
SQL>
execute dbms_stats.gather_table_stats(ownname => 'scott', tabname
=> 'employee',METHOD_OPT => 'FOR COLUMNS SIZE 10 b’);
select
lpad(TABLE_NAME,10), lpad(COLUMN_NAME, 5),
endpoint_number, endpoint_value
from
user_histograms;
TABLE_NAME
COLUMN_NAME ENDPOINT_NUMBER ENDPOINT_VALUE
TAB1 B 1 1
TAB1 B 2 2
TAB1 B 3 3
TAB1 B 4 4
TAB1 B
9995 5
TAB1 B 9996 9996
TAB1 B 9997 9997
TAB1 B 9998 9998
TAB1 B 9999 9999
TAB1 B 10000 10000
So,
now there are statistics on the table and on the columns.
You
requested 10 buckets (size 10) and there are 10 distinct values.
The ENDPOINT_VALUE
shows the column value and the ENDPOINT_NUMBER
shows the cumulative
number of rows.
ENDPOINT_NUMBER is 1,
hence the number of rows with value 2 is 1.
Another example is
ENDPOINT_VALUE 5. Its ENDPOINT_NUMBER is 9995. The previous
bucket ENDPOINT_NUMBER
is 4, so 9995 - 4 = 9991 rows containing the value 5.
So,
now QUERY (1) does in fact do a Full Table Scan.
SQL>
select * from tab1 where b=5
SQL>
/
Execution
Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=ALL_ROWS
(Cost=10 Card=9991 Bytes=99910)
1 0
TABLE ACCESS (FULL) OF 'TAB1' (Cost=10 Card=9991 Bytes=99910)
And,
QUERY (2) does do an Index Range Scan.
SQL>
select * from tab1 where b=3
SQL>
/
Execution
Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=ALL_ROWS
(Cost=6 Card=500 Bytes=5000)
1 0
TABLE ACCESS (BY ROWID) OF 'TAB1' (Cost=6 Card=500 Bytes=5000)
2 1
INDEX (RANGE SCAN) OF 'TAB1_B' (NON-UNIQUE)
This
is fine if you have a low number of distinct values, but there can
be
tables with a huge number of distinct values.
You don't want to
create
a bucket for each value. There would be too much overhead.
In
this case you would request less buckets than distinct values.
CREATING
HISTOGRAMS WITH LESS BUCKETS THAN DISTINCT VALUES
----------------------------------------------------------
SQL>
execute dbms_stats.gather_table_stats(ownname => 'scott', tabname
=> 'employee',METHOD_OPT => 'FOR COLUMNS SIZE 8 b’);
SQL>
select lpad(TABLE_NAME,10), lpad(COLUMN_NAME, 5),
2>
endpoint_number, endpoint_value
3> from user_histograms;
LPAD(TABLE
LPAD( ENDPOINT_NUMBER ENDPOINT_VALUE
----------
----- --------------- --------------
TAB1 B 0 1
TAB1 B 7 5
TAB1 B 8 10000
Here,
Oracle creates the requested number of buckets but puts the same
number
of values into each bucket, where there are more endpoint values
that
are the same for the frequently occurring value.
The
ENDPOINT_NUMBER is the actual bucket number and ENDPOINT_VALUE is
the
endpoint value of the bucket determined by the column value.
From
above, bucket 0 holds the low value for the column. You cannot see
buckets
1 to 6 so as to save space.
But
we have bucket 1 with an endpoint of 5,
bucket 2 with an endpoint of 5,
bucket 3 with an endpoint of 5,
bucket 4 with an endpoint of 5,
bucket 5 with an endpoint of 5,
bucket 6 with an endpoint of 5,
bucket 7 with an endpoint of 5 AND
bucket 8 with an endpoint of 10000
So
bucket 8 contains values between 5 and 10000.
All
buckets contain the same number of values (which is why they are called
height-balanced
histograms), except the last bucket may have less values
then
the other buckets.
If
the data is uniform, you would not use histograms. However, if you request
the
same number of buckets as distinct values, Oracle creates 1 bucket. If
you
request less buckets, Oracle uses an algorithm to balance values into each
bucket
and any values that remain (which have to be less then the number
stored
in each height-balanced bucket) go into the last bucket.
STORING
CHARACTER VALUES IN HISTOGRAMS
--------------------------------------
Character
columns have some exceptional behaviour, in as much as we store
histogram
data for the first 32 bytes of any string.
Any predicates that
contain
strings greater than 32 characters will not use histogram information
and
the selectivity will be 1 / DISTINCT.
Data
in histogram endpoints is normalized to double precision floating point
arithmetic.
EXAMPLE
-------
SQL>
select * from morgan;
A
----------
a
b
c
d
e
e
e
e
The
table contains 5 distinct values. There is one occurance of 'a', 'b', 'c'
and
'd' and 4 of 'e'.
Create
a histogram with 5 buckets.
SQL>
analyze table morgan compute statistics for columns a size 5;
Looking
in user_histograms:
LPAD(TABLE
LPAD( ENDPOINT_NUMBER ENDPOINT_VALUE
----------
----- --------------- --------------
MORGAN
A 1 5.0365E+35
MORGAN
A 2 5.0885E+35
MORGAN
A 3 5.1404E+35
MORGAN
A 4 5.1923E+35
MORGAN
A 8 5.2442E+35
So,
ENDPOINT_VALUE 5.0365E+35 represents a
5.0885E+35 represents b
5.1404E+35
represents c
5.1923E+35 represents d
5.2442E+35
represents e
Then
if you look at the cumulative values for ENDPOINT_NUMBER,
the
corresponding ENDPOINT_VALUE's are
correct.
No comments:
Post a Comment