|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Oracle SQL Tuning and CBO Internal Histograms
Another feature requiring the cost-based optimizer is histograms. Histograms are created for indexed columns that are badly skewed, such as the REGION column in the STATS table. The purpose of a histogram is to help the cost-based optimizer make a better decision about whether or not to use an index and to reduce the number of full table scans. The following query shows that one row contains a region code of NW, while the other seven region codes contain either 125,000 or 249,999 entries. SQL> SQL> SELECT REGION, COUNT(*) 2 FROM STATS 3* GROUP BY REGION; REGION COUNT(*) ------ ---------- E 125000 N 125000 NE 125000 NW 1 S 125000 SE 125000 SW 125000 W 249999 8 rows selected. Without a histogram on the REGION column, Oracle will not use an index on the REGION column. The following code creates a btree index on the REGION column of the STATS table. SQL> CREATE INDEX I_STATS_REGION ON STATS(REGION); Index created. Next, analyze the STATS table because you just added a new index. SQL> ANALYZE TABLE STATS COMPUTE STATISTICS; Table analyzed. Use AUTOTRACE to see the plan. SQL> SET AUTOTRACE TRACEONLY EXPLAIN Notice that without a histogram, the cost-based optimizer would perform a full table scan on a one million-row table to return one row. SQL> SELECT * 2 FROM STATS 3 WHERE REGION = 'NW'; Execution Plan ----------------------------------------- 0 SELECT STATEMENT Optimizer=CHOOSE (Cost=583 Card=125000 Bytes=2625000) 1 0 TABLE ACCESS (FULL) OF 'STATS' (Cost=583 Card=125000 Bytes=2625000) Why does Oracle do this? The following two queries reveal
why. In user_indexes, the
DISTINCT_KEYS column shows that Oracle knows there are eight different
values. In user_tables, the
NUM_ROWS column indicates that Oracle knows the number of rows in the
stats table. You can almost see
the light go on in the optimizer. Ah Hah! Because there are one million rows
with eight different values, I’ll just divide 8 into 1,000,000. The
optimizer’s incorrect conclusion is that each of the eight values comprise
125,000 rows of the stats table.
Obviously, the optimizer will not use an index to
retrieve 1/8th of the rows from a one million-row table, and will perform a
full table scan instead. SQL> SELECT DISTINCT_KEYS 2 FROM USER_INDEXES 3 WHERE INDEX_NAME = 'I_STATS_REGION'; DISTINCT_KEYS ------------- 8 SQL> SELECT NUM_ROWS 2 FROM USER_TABLES 3 WHERE TABLE_NAME = 'STATS'; NUM_ROWS ---------- 1000000 The REGION column is in bad need of a histogram. It is better to ask for too many buckets than to ask for too few. Although you request 250 buckets, Oracle creates 7 buckets (see Exhibit 4). SQL> ANALYZE TABLE STATS -- Default 75 Buckets 2 COMPUTE STATISTICS -- Maximum Of 254 Buckets 3 FOR TABLE -- Oracle Uses Only Required Buckets 4* FOR COLUMNS REGION SIZE 250; Table analyzed. SQL> SELECT NUM_DISTINCT, 2 LOW_VALUE, 3 HIGH_VALUE, 4 DENSITY, 5 NUM_BUCKETS 6 FROM USER_TAB_COL_STATISTICS 7* WHERE TABLE_NAME = 'STATS' AND COLUMN_NAME = 'REGION'; NUM_DISTINCT LOW_VALUE HIGH_VALUE DENSITY NUM_BUCKETS ------------ --------- ---------- ---------- -----------
8
45
57 .0000005
7 Exhibit 4.
Histogram for REGION Column Oracle now has a better feel for how the data is distributed in the REGION column by the values in the ENDPOINT_NUMBER column (see Exhibit 5). SQL> SELECT COLUMN_NAME, 2 ENDPOINT_NUMBER, 3 ENDPOINT_VALUE 4 FROM USER_HISTOGRAMS 5 WHERE TABLE_NAME = 'STATS' AND COLUMN_NAME = 'REGION' 6* ORDER ENDPOINT_NUMBER; COLUMN_NAME ENDPOINT_NUMBER ENDPOINT_VALUE ------------ --------------- -------------- REGION 125000 3.5827E+35 REGION 250000 4.0500E+35 REGION 375000 4.0640E+35 REGION 375001 4.0676E+35 REGION 500001 4.3096E+35 REGION 625001 4.3236E+35 REGION 750001 4.3273E+35 REGION 1000000 4.5173E+35 8 rows selected. Exhibit 5. Data Distribution in the REGION Column With this new histogram information, the query executed above should now cause Oracle to use the index on the REGION column instead of a full table scan, and it does. Notice the cost is only 4 compared to a cost of 583. In addition, this query requires 0.1 seconds to execute using the index, whereas a full table scan requires nearly 10 seconds. SQL> SET AUTOTRACE TRACEONLY EXPLAIN SQL> SELECT * 2 FROM STATS 3 WHERE REGION = 'NW'; Execution Plan ----------------------------------------- 0 SELECT STATEMENT Optimizer=CHOOSE (Cost=4 Card=1 Bytes=21) 1 0 TABLE ACCESS (BY INDEX ROWID) OF 'STATS' (Cost=4 Card=1 Bytes=21) 2 1 INDEX (RANGE SCAN) OF 'I_STATS_REGION' (NON-UNIQUE) (Cost=3 Card=1) HistogramsIn some cases, the distribution of values within a column of a table will affect the optimizer's decision to use an index vs. performing a full-table scan. This scenario occurs when the value with a where clause does not have an equivalent distribution of values, making a full-table scan cheaper than index access in some cases. The problem is the optimizer can’t tell when it is using a value with a few rows, and when it is using a value with a large number of rows. A column histogram should only be created when we have data skew exists or suspected. In the real world, that happens rarely, and one of the most common mistakes with the optimizer is the unnecessary introduction of histograms into optimizer statistics. The histograms signals the optimizer that the column is not linearly distributed, and the optimizer will peek into the literal value in the SQL where clause, and compare that value to the histogram buckets in the histogram statistics. While they are used to make a yes-or-no decision about the use of an index to access the table, histograms are most commonly used to predict the size of the intermediate result set from a multi-way table join. For example, assume that we have a five-way table join whose result set will be only 10 rows. Oracle will want to join the tables together in such a way as to make the result set (cardinality) of the first join as small as possible. By carrying less baggage in the intermediate result sets, the query will run faster. To minimize intermediate results, the optimizer attempts to estimate the cardinality of each result set during the parse phase of SQL execution. Having histograms on skewed columns will greatly aid the optimizer in making a proper decision. (Remember, you can create a histogram even if the column does not have an index and does not participate as a join key.) Because a complex schema might have tens of thousands of columns, it is impractical to evaluate each column for skew and thus Oracle provides an automated method for building histograms as part of the dbms_stats utility. By using the method_opt=>'for all columns size skewonly' option of dbms_stats, you can direct Oracle to automatically create histograms for those columns whose values are heavily skewed. As a general rule, histograms are used to predict the cardinality and the number of rows returned in the result set. For example, assume that we have a product_type index and 70% of the values are for the HARDWARE type. Whenever SQL with where product_type='HARDWARE' is specified, a full-table scan is the fastest execution plan, while a query with where product_type='SOFTWARE' would be fastest using index access. Because histograms add additional overhead to the parsing phase of SQL, you should avoid them unless they are required for a faster optimizer execution plan. But, there are several conditions where creating histograms is advised:
So how do you find those columns that are appropriate for histograms? There is a feature in dbms_stats that provides for the ability to automatically look for columns that should have histograms, and create the histograms. Multi-bucket histograms add a huge parsing overhead to SQL statements, and histograms should only be used when the SQL will choose a different execution plan based upon the column value. To aid in intelligent histogram generation, Oracle uses the method_optparameter of dbms_stats. There are also important new options within the method_opt clause, namely skewonlyand auto (and others). method_opt=>'for all columns size skewonly' method_opt=>'for all columns size auto' The first is the "skewonlyoption, which is very time-intensive because it examines the distribution of values for every column within every index. If dbms_stats discovers an index with columns that are unevenly distributed, it will create histograms for that index to aid the cost-based SQL optimizer in making a decision about index vs. full-table scan access. For example, if an index has one column that is in 50% of the rows, a full-table scan is faster than an index scan to retrieve these rows. Histograms are also used with SQL that has bind variables and SQL with cursor_sharing enabled. In these cases, the optimizer determines if the column value could affect the execution plan, and if so, replace the bind variable with a literal and performs a hard parse.
dbms_stats.gather_schema_stats( ownname => 'SCOTT', estimate_percent => dbms_stats.auto_sample_size, method_opt => 'for all columns size skewonly', degree => 7 ); end; /
The auto option is used when monitoring is implemented (alter table xxx monitoring) and creates histograms based upon data distribution and the manner in which the column is accessed by the application (e.g., the workload on the column as determined by monitoring). Using method_opt=>'auto' is similar to using the gather auto in the option parameter of dbms_stats: begin dbms_stats.gather_schema_stats( ownname => 'SCOTT', estimate_percent => dbms_stats.auto_sample_size, method_opt => 'for all columns size auto', degree => 7 ); end; /
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Copyright © 1996 - 2011 by
Burleson Enterprises. All rights reserved.
Oracle® is the registered trademark
of Oracle Corporation. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||