Saturday, 17 August 2013

DB2 Indexing

DB2 uses a B-tree index. It can be used-for matching scans when looking for specific value. Pointers chain the leaf nodes in the sequence of key values. Thus the index can be used to find a specific value from where sequential operations may then be performed. This is a non-matching scan. This can be used to locate all customers having Ids between 200 and 500, say (indexed on ID).
First a matching scan will find the key 200. Then on sequential traversal begins similarly requests for all values below a given value will first fix on the start of the chain and go on till the specified end point is reached. Sometimes DB2 can satisfy the query by doing a scan of the index only without going to the data pages at all (matching or non-matching scans without data reference).
Structure of the BTREE Node
The leaf page level must hold enough index entries and associated pointers to point to point to all rows in a table. For unique indexes, the leaf node holds one entry and one pointer. The pointer is a 4 byte RID indicating the row’s position in the data pages. The leaf page in a non-unique index has space for a value and up to 255 pointers. If more than 255 occurrences of a particular value exist then a new leaf node with the value and 255 more pointer is created, and so on. A single index page contains 3660 bytes of usable space. The remaining space may be used for control Information.
Selecting columns to index
Updating a value on which a table is indexed may cause it to be moved from one page to another. When a table is loaded or recovered each index to it must be built or rebuilt. When table spaces are re-organized indexes must be rebuilt. If a table has less than 7 pages the optimizer would generally not use an index even if one exists, because a complete scan will be faster. Possible exceptions to this arise when joins or referential integrity checking are involved. Updates represent most of the cost of indexes. If a table experiences bulk periodic updates, affecting more than about 10 of its rows, the dropping and recreating the indexes may be better. Therefore, we should avoid indexing columns
  1. that have redundant values, unless a clustering index is used
  2. that have a skewed distribution of data
  3. that are frequently updated
  4. that are longer than 30 character long
non-clustering indexes are beneficial on columns that are searched often, or used in joins of small percentages of their rows. Foreign keys are good candidates. Indexing columns on which aggregates are often performed (SUM, COUNT, etc.), is also beneficial as DB2 can satisfy the query without accessing the data pages. The same reasoning applies to MAX, MIN, etc.
Example: Consider the following query,
If there is an index on QTY, then only leaf nodes of the index pages need to be accessed. From that, we can find out the SUM (QTY) without accessing data pages at all.
Clustering indexes should be selected with care since only one is allowed per table. God for columns with high redundancy that are searched over ranges of values. Hence clustered index should not be created on primary key or unique keys. Foreign key columns are usually highly redundant, and are used in joins. They are good candidates for clustering indexes.
MAX and MIN can be satisfied in one index access of descending or ascending index respectively. If a predicate is predicate is specified in a query with an aggregate function, a matching scan can be used to narrow the search, before the aggregate is applied.
Processing all leaf pages will require much fewer I/Os than scanning all data pages since many more leaf nodes than data records will fit into a page. Clustering is useful when all the values need to be processed in sequence.
Without a clustering index, DB2 will need to sort to satisfy queries involving GROUP BY, ORDER BY or DISTINCT. While inserting, if a clustering index exists, it will be used. If no clustering index exists, a non-clustering index may be chosen by the optimizer. This will be the first index created for the table. Initially values are inserted as if this were the chosen clustering index.
Re-organization will however not preserve this clustering. Columns used frequently together can be considered for composite indexing. Querying on trailing portions of a composite index is inefficient.
Columns to Avoid Indexing high redundancy columns are not good for non clustering indexes. This is because of high cost of updating, as well as because they are of little value for retrieval. If a high percentage of data rows contain a particular value, a full scan will be more efficient than going through an index. In general if distinct values will, on the average, appear more than once on each data page, then non clustering index is of little use.
Consider a table A with cardinality of 20,000 and assume these 20,000 rows are uniformly distributed among 100 pages. Consider a column of the table A with cardinality of 100. Number of rows for a value of the column = 20,000/100 = 200 These 200 rows may uniformly get distributed among 100 pages. A page will have 2 rows for a given value of the column.
The, for a equality condition, at the worst case, we need to access each page twice. In this condition, using index is of little use. Suppose if the number of distinct values is 400, then number of rows selected for a value will be 50 and a page has only 0.5 row for a particular equality condition. Sometimes indexes will be needed for locating in frequently occurring data in skewed tables.
Multiple Indexes Useful for satisfying queries involving multiple predicates jointed by AND and OR. The qualifying RIDs obtained through the individual indexes are sorted an merged. The resulting RIDs will be in order and hence reduce the number of page accessed (list pre-fetch)
Value of Clustering If a non-clustering index is used for accessing data, then a given data page may be accessed multiple times. Clustering indexes allow for merge join without sorting the inner and outer tables. Indexes will be used in joins only if the joining columns are of the same data type and length. This kind of mismatch is sometimes the cause of poor performance. Use of functions to achieve compatibility also will not help as this will negate the use of indexes.
Consider an employee table with 10,000 rows located on 200 data pages, Employees distributed uniformly across 20 offices. Consider locating all employees in a particular office (500 rows will be returned). With a non-clustering index the leaf pages will point to the 500 RIDs. Thus 500 I/Os will be required, with some of the 200 data pages being read more than once. In some cases list pre-fetch will be employed. In this the RIDs will be sorted and sequential pre-fetch will be used, in which 32 pages will be obtained in one I/O. each page need be read only once. Thus the number of I/Os will be 200/32.
With a clustering index a sort of the RIDs can be avoided as once the first record in the office is located (with a matching index scan), a sequential scan will give the rest. Thus about 10 I/Os will suffice (over and above those for the index scan). If the distribution were not uniform and say 70-80% of the employees were located in the said office, then use of the index will not be efficient and table space scan will be preferred.
With static SQL where a host language variable gave a relevant variable, the optimizer will not be able to decide not to use the index. With dynamic SQL this would be possible.
Index Cardinality and Composite Indexes DB2 analyses the extent of redundancy and decides whether a table space scan or index scan is preferred. It estimates what percentage of the total number of row will be returned by the SELECT. This is the FILTER FACTOR. With composite indexes the individual filter factors are multiplied. There may not always be complete information available to DB2. The following information is kept by DB2.
FIRSTKEYCARD – Number of distinct values in a single indexed column or in the leading column of a composite index. SYSIBM. SYSINDEXES keeps this information. DB2 uses this to determine redundancy for single column indexes or when only the leading column of a multiple column index is specified.
FULLKEYCARD – Number of distinct values in a single column. Kept in SYSIBM.SYSCOLUMNS. If the column is indexed this will be the same as FIRSTKEYCARD. This can be entered by the user (estimate) or RUNSTATS can be invoked to update it. The optimizer uses it to determine redundancy for composite indexes in some cases.
Impact of Column Ordering in Indexes Ordering of columns in a composite index can be important. Consider the following query:
Consider keeping a composite index on city and weight. If there were far fewer entries with CITY = ‘XYZ’ than parts with weight > 500, then putting city before part will be useful because a matching scan for XYZ, 500 followed by a sequential scan will read about half of all the XYZ entries to satisfy the query. If weight were put first, then a matching scan for 500, XYZ followed by a sequential search of all parts with weight greater than 500 may be needed. Columns with high filtering capability should be put first.
Intersection Tables intersection tables are tables made up of a number of foreign key columns that participate in a composite primary key (example SPJ table). Foreign keys are good candidates for indexing. In the SPJ case do we need indexes on all three? A unique index on SN, PN, JN will anyway exist. Thus this itself can be used for referential integrity checking and for joins on the first column.
Thus one index can be eliminated. About the others, the kind of activity is the guide. If heavy update will take place on the primary key then referential integrity checking will be facilitated if the foreign key index exists. Without it, table space scan will be necessary.
High Redundancy Indexes Updating these requires heavy I/O. this is because the RID’s have to be added to the list, which may cause page splitting and consequently locking of index and loss of concurrency. If several redundant columns are indexed in a table, then update and delete processing can be very slow.
Delete processing is even slower as it involves deleting and shifting of RIDs to the left from each redundant index. Low cardinality columns should ideally not be indexed. If at all needed, some other column can be added to increase the cardinality to improve performance.
Maintaining Clustering DB2 keeps track of which columns are clustered. After RUNSTATS the percentage of clustering is determined. If it is above 95%, then the column is considered to be clustered. Clustering is maintained as long as free pages and free space in pages will allow it.
After that clustering will be disturbed, till reorganization is done again. The percentage of clustering is used by DB2 to determine if an index should be used. The optimizer will go by the value recorded in SYSIBM.SYSINDEXES for CLUSTERRATIO.
If RUNSTATS is not executed, then the optimizer can be misled to make a wrong choice. Periodically re-organization should be done.
To identify the tables which are marked as clustered but are in fact below 80% clustering the administrator my use SQL:
REOGR will not re-cluster an index when multiple tables share a simple table space. Running REORG does not cause automatic rebind. Typically, after a REORG, RUNSTATS should be used so that the optimizer can use current information on future bind operations.
Index Monitoring tow utilities are provided for index monitoring, viz., SYSPLANDEP and EXPLAIN. SYSPLANDEP can be used to find the programs using an index. If very little use is being made the index can be dropped. EXPLAIN command can be used to determine what access path is chosen by the optimizer for a plan. The user can then explore alternate ways of stating quering to make best use of available indexes, or to create new indexes.

Created with Artisteer

No comments:

Post a comment