Db2 Indexing: A Comprehensive Guide to Improving Database Performance
Db2, one of the leading relational database management systems, is widely used for managing large amounts of structured data. With an increasing demand for better performance, it becomes essential to understand the concepts of indexing and how it can improve database performance. In this blog post, we will discuss what indexing is, why it is important, and different types of indexes available in Db2.
Indexing is a database optimization technique that speeds up data retrieval operations by providing a quick way to access specific data in a table. It works by creating a separate structure that contains a mapping of values from the indexed column to the corresponding row in the table. This way, instead of scanning the entire table to find the required data, the database can look up the index and directly retrieve the required information.
There are various types of indexes available in Db2, including the following:
B-Tree Index
Bitmap Index
Hash Index
Column-Organized Index
XML Index
B-Tree index is the most commonly used index type in Db2. It is suitable for use in situations where data is frequently searched, updated, or sorted. Bitmap index, on the other hand, is best suited for data warehousing and decision support systems where data is typically read-intensive. Hash index is useful in scenarios where data is evenly distributed, and the query requires an exact match. Column-Organized index is optimized for column-based data storage and retrieval, while XML index is used to index XML data stored in a column.
In conclusion, indexing is a crucial aspect of database performance optimization and should not be overlooked. By choosing the right type of index and indexing the appropriate columns, you can significantly improve the performance of your Db2 database. Understanding the concepts of indexing and the different types of indexes available in Db2 will help you make informed decisions about your database design and improve its overall performance.
It's important to note that while indexing can greatly improve the performance of a database, it also has some drawbacks. One of the primary disadvantages is that index maintenance can become time-consuming and resource-intensive, especially as the amount of data in the database grows. Additionally, indexes take up disk space and can slow down data insertion and update operations, as the index must be updated every time a row is inserted or modified.
To get the most out of indexing, it's essential to have a thorough understanding of the data being stored and the queries being executed against it. This information can be used to determine which columns should be indexed, and which index type would be most appropriate. It's also important to monitor the performance of the database regularly and make adjustments as needed, including adding or removing indexes, as the data and query patterns change over time.
In conclusion, Db2 indexing is a crucial aspect of database performance optimization. By understanding the concepts of indexing and the different types of indexes available in Db2, you can make informed decisions about your database design and improve its overall performance. While indexing has some disadvantages, the benefits it provides in terms of improved query performance and reduced query execution time make it well worth the effort.
DB2 uses a B-tree index. It can be used-for matching scans when looking for
specific value. Pointers chain the leaf nodes in the sequence of key values.
Thus the index can be used to find a specific value from where sequential
operations may then be performed. This is a non-matching scan. This can be used
to locate all customers having Ids between 200 and 500, say (indexed on ID).
First a matching scan will find the key 200. Then on sequential traversal begins
similarly requests for all values below a given value will first fix on the
start of the chain and go on till the specified end point is reached. Sometimes
DB2 can satisfy the query by doing a scan of the index only without going to the
data pages at all (matching or non-matching scans without data reference).
Structure of the BTREE Node
The leaf page level must hold enough index entries and
associated pointers to point to point to all rows in a table. For unique
indexes, the leaf node holds one entry and one pointer. The pointer is a 4 byte
RID indicating the row’s position in the data pages. The leaf page in a
non-unique index has space for a value and up to 255 pointers. If more than 255
occurrences of a particular value exist then a new leaf node with the value and
255 more pointer is created, and so on. A single index page contains 3660 bytes
of usable space. The remaining space may be used for control Information.
Selecting columns to index
Updating a value on which a table is indexed may cause it to be
moved from one page to another. When a table is loaded or recovered each index
to it must be built or rebuilt. When table spaces are re-organized indexes must
be rebuilt. If a table has less than 7 pages the optimizer would generally not
use an index even if one exists, because a complete scan will be faster.
Possible exceptions to this arise when joins or referential integrity checking
are involved. Updates represent most of the cost of indexes. If a table
experiences bulk periodic updates, affecting more than about 10 of its rows, the
dropping and recreating the indexes may be better. Therefore, we should avoid
indexing columns
- that have redundant values, unless a clustering index is used
- that have a skewed distribution of data
- that are frequently updated
- that are longer than 30 character long
non-clustering indexes are beneficial on columns that are
searched often, or used in joins of small percentages of their rows. Foreign
keys are good candidates. Indexing columns on which aggregates are often
performed (SUM, COUNT, etc.), is also beneficial as DB2 can satisfy the query
without accessing the data pages. The same reasoning applies to MAX, MIN,
etc.
Example: Consider the following query,
SELECT SUM (QTY) FROM SP;
If there is an index on QTY, then only leaf nodes of the index
pages need to be accessed. From that, we can find out the SUM (QTY) without
accessing data pages at all.
Clustering indexes should be selected with care since only one
is allowed per table. God for columns with high redundancy that are searched
over ranges of values. Hence clustered index should not be created on primary
key or unique keys. Foreign key columns are usually highly redundant, and are
used in joins. They are good candidates for clustering indexes.
MAX and MIN can be satisfied in one index access of descending
or ascending index respectively. If a predicate is predicate is specified in a
query with an aggregate function, a matching scan can be used to narrow the
search, before the aggregate is applied.
Processing all leaf pages will require much fewer I/Os than
scanning all data pages since many more leaf nodes than data records will fit
into a page. Clustering is useful when all the values need to be processed in
sequence.
Without a clustering index, DB2 will need to sort to satisfy
queries involving GROUP BY, ORDER BY or DISTINCT. While inserting, if a
clustering index exists, it will be used. If no clustering index exists, a
non-clustering index may be chosen by the optimizer. This will be the first
index created for the table. Initially values are inserted as if this were the
chosen clustering index.
Re-organization will however not preserve this clustering.
Columns used frequently together can be considered for composite indexing.
Querying on trailing portions of a composite index is inefficient.
Columns to Avoid Indexing high redundancy columns are not good
for non clustering indexes. This is because of high cost of updating, as well as
because they are of little value for retrieval. If a high percentage of data
rows contain a particular value, a full scan will be more efficient than going
through an index. In general if distinct values will, on the average, appear
more than once on each data page, then non clustering index is of little use.
Example:
Consider a table A with cardinality of 20,000 and assume these
20,000 rows are uniformly distributed among 100 pages. Consider a column of the
table A with cardinality of 100. Number of rows for a value of the column =
20,000/100 = 200 These 200 rows may uniformly get distributed among 100 pages. A
page will have 2 rows for a given value of the column.
The, for a equality condition, at the worst case, we need to
access each page twice. In this condition, using index is of little use. Suppose
if the number of distinct values is 400, then number of rows selected for a
value will be 50 and a page has only 0.5 row for a particular equality
condition. Sometimes indexes will be needed for locating in frequently occurring
data in skewed tables.
Multiple Indexes Useful for satisfying queries
involving multiple predicates jointed by AND and OR. The qualifying RIDs
obtained through the individual indexes are sorted an merged. The resulting RIDs
will be in order and hence reduce the number of page accessed (list pre-fetch)
Value of Clustering If a non-clustering index
is used for accessing data, then a given data page may be accessed multiple
times. Clustering indexes allow for merge join without sorting the inner and
outer tables. Indexes will be used in joins only if the joining columns are of
the same data type and length. This kind of mismatch is sometimes the cause of
poor performance. Use of functions to achieve compatibility also will not help
as this will negate the use of indexes.
Example:
Consider an employee table with 10,000 rows located on 200 data
pages, Employees distributed uniformly across 20 offices. Consider locating all
employees in a particular office (500 rows will be returned). With a
non-clustering index the leaf pages will point to the 500 RIDs. Thus 500 I/Os
will be required, with some of the 200 data pages being read more than once. In
some cases list pre-fetch will be employed. In this the RIDs will be sorted and
sequential pre-fetch will be used, in which 32 pages will be obtained in one
I/O. each page need be read only once. Thus the number of I/Os will be 200/32.
With a clustering index a sort of the RIDs can be avoided as
once the first record in the office is located (with a matching index scan), a
sequential scan will give the rest. Thus about 10 I/Os will suffice (over and
above those for the index scan). If the distribution were not uniform and say
70-80% of the employees were located in the said office, then use of the index
will not be efficient and table space scan will be preferred.
With static SQL where a host language variable gave a relevant
variable, the optimizer will not be able to decide not to use the index. With
dynamic SQL this would be possible.
Index Cardinality and Composite Indexes DB2 analyses the extent
of redundancy and decides whether a table space scan or index scan is preferred.
It estimates what percentage of the total number of row will be returned by the
SELECT. This is the FILTER FACTOR. With composite indexes the individual filter
factors are multiplied. There may not always be complete information available
to DB2. The following information is kept by DB2.
FIRSTKEYCARD – Number of distinct values in a single indexed
column or in the leading column of a composite index. SYSIBM. SYSINDEXES keeps
this information. DB2 uses this to determine redundancy for single column
indexes or when only the leading column of a multiple column index is specified.
FULLKEYCARD – Number of distinct values in a single column. Kept
in SYSIBM.SYSCOLUMNS. If the column is indexed this will be the same as
FIRSTKEYCARD. This can be entered by the user (estimate) or RUNSTATS can be
invoked to update it. The optimizer uses it to determine redundancy for
composite indexes in some cases.
Impact of Column Ordering in Indexes Ordering of columns in a
composite index can be important. Consider the following query:
SELECT PN, PNAME FROM P WHERE CITY =’XYZ’ AND WEIGHT > 500;
Consider keeping a composite index on city and weight. If there
were far fewer entries with CITY = ‘XYZ’ than parts with weight > 500, then
putting city before part will be useful because a matching scan for XYZ, 500
followed by a sequential scan will read about half of all the XYZ entries to
satisfy the query. If weight were put first, then a matching scan for 500, XYZ
followed by a sequential search of all parts with weight greater than 500 may be
needed. Columns with high filtering capability should be put first.
Intersection Tables intersection tables are
tables made up of a number of foreign key columns that participate in a
composite primary key (example SPJ table). Foreign keys are good candidates for
indexing. In the SPJ case do we need indexes on all three? A unique index on SN,
PN, JN will anyway exist. Thus this itself can be used for referential integrity
checking and for joins on the first column.
Thus one index can be eliminated. About the others, the kind of
activity is the guide. If heavy update will take place on the primary key then
referential integrity checking will be facilitated if the foreign key index
exists. Without it, table space scan will be necessary.
High Redundancy Indexes Updating these requires heavy I/O. this
is because the RID’s have to be added to the list, which may cause page
splitting and consequently locking of index and loss of concurrency. If several
redundant columns are indexed in a table, then update and delete processing can
be very slow.
Delete processing is even slower as it involves deleting and
shifting of RIDs to the left from each redundant index. Low cardinality columns
should ideally not be indexed. If at all needed, some other column can be added
to increase the cardinality to improve performance.
Maintaining Clustering DB2 keeps track of which
columns are clustered. After RUNSTATS the percentage of clustering is
determined. If it is above 95%, then the column is considered to be clustered.
Clustering is maintained as long as free pages and free space in pages will
allow it.
After that clustering will be disturbed, till reorganization is
done again. The percentage of clustering is used by DB2 to determine if an index
should be used. The optimizer will go by the value recorded in SYSIBM.SYSINDEXES
for CLUSTERRATIO.
If RUNSTATS is not executed, then the optimizer can be misled to
make a wrong choice. Periodically re-organization should be done.
To identify the tables which are marked as clustered but are in
fact below 80% clustering the administrator my use SQL:
SELECT NAME, CREATOR, TBNAME, CLUSTERRATIO, CLUSTERING,
CLUSTERED FROM SYSIBM.SYSINDEXES WHERE DBNAME =’ABCDEF’ AND CLUSTERING =’Y’ AND
CLUSTERRATIO < 0.80
REOGR will not re-cluster an index when multiple tables share a
simple table space. Running REORG does not cause automatic rebind. Typically,
after a REORG, RUNSTATS should be used so that the optimizer can use current
information on future bind operations.
Index Monitoring tow utilities are provided for
index monitoring, viz., SYSPLANDEP and EXPLAIN. SYSPLANDEP can be used to find
the programs using an index. If very little use is being made the index can be
dropped. EXPLAIN command can be used to determine what access path is chosen by
the optimizer for a plan. The user can then explore alternate ways of stating
quering to make best use of available indexes, or to create new indexes.
No comments:
Post a Comment