How b-tree database indexes work and how to tell if they are efficient (100' level)
A team member thought we should add an index on a 90 million row table to improve performance. The field on which he wanted to create this index had only four possible values. To which I replied that an index on a low cardinality field wasn't really going to help anything. My boss then asked me why wouldn't it help? I sputtered around for a response but ended up telling him that I'd get back to him with a reasonable explanation.
Now I'm not a DBA by any stretch but I've learned about database optimization and performance on the job from some really bright folks. I didn't have a very good grasp on how indexes worked though. So I did some research on the topic.
There are several kinds of indexes that databases use. Sybase IQ has like 20 different kinds, Oracle and DB2 appear to have two. The main type of index out there is a b-tree; this is the type that most people mean when they say database index.
What is a b-tree?
In a tree, records are stored in locations called leaves. The starting point is called the root. The maximum number of children per node is called the order of the tree. The maximum number of access operations required to reach the desired leaf (data stored on the leaf) is called the depth (level). Oracle indexes are balanced b-trees; the order is the same at every node and the depth is the same for every leaf.
|real tree (in nature)||b-tree|
|grows up||grows down|
The bigger the order, the more leaves and nodes you can put at a certain depth. This means that there are fewer levels to traverse to get to the leaf (which contains the data you want). In the example above and all balanced b-trees, the number of hops to a leaf == depth.
How does a b-tree help with database access?
Most indexes are too large to fit into memory, which means that they are going to be stored on disk. Since I/O is usually the most expensive thing you can do in a computer system, these indexes need to be stored in an I/O efficient way.
A b-tree is a good way to do this. If we make the nodes the size of a physical I/O block, it would take one I/O to move to a lower depth in the tree. In the example below, an index was created on a first name kind of field.
If every level were an I/O it would take 3 I/Os to find Mary (or any other leaf).
How good is the index?
Now back to the original point I was trying to make-- low cardinality fields make bad indexes. Why is this the case? The answer here is really about selectivity.
unique index values selectivity = ----------------------- total number records
A primary key is highly selective. If there are 1000 rows, there will be 1000 unique keys in the index. Eacy unique key will return at most 1 row. The index will be 100% selective (1000/1000).. the best you can get.
Now let's say we have an index on a low cardinality thing like gender. If we had 1000 records, the selectivity is in the database is 2/1000 =.2%. Said in another way, 500 records come back per unique key (1000 records / 2 uniques).
Note: this seems to assume an even distribution of data (e.g. 500 male, 500 female). Things might be different if you had 999 males, and 1 female.
10% selectivity is the minimum selectivity necessary for a b-tree index to be helpful.
What to do about low cardinality columns?
I'm not going to go into it in any detail. You can either use a different kind of index (e.g. Bitmap index) or combine that column with another to make a highly selective composite index.
Oracle way to find low selectivity indexes
Run this query to get an idea of how your indexes are set up.
SELECT INDEX_NAME "NAME", DISTINCT_KEYS / NUM_ROWS * 100 "SELECTIVITY %", NUM_ROWS, DISTINCT_KEYS "DISTINCT", LEAF_BLOCKS, CLUSTERING_FACTOR, BLEVEL "LEVEL", AVG_LEAF_BLOCKS_PER_KEY "ALFBPKEY" FROM DBA_INDEXES WHERE DISTINCT_KEYS / NUM_ROWS < .1 AND NUM_ROWS > 0 ORDER BY "SELECTIVITY %" DESC;
B-tree indexes are created to decrease the amount of I/O required to find and load a set of data. A highly selective index uses least amount of I/O necessary, poorly selective indices are not much better than a table scan.
How to measure index selectivity
Effective indexes for Data Warehouses
Density and selectivity in column indexes
Verifying index statistics
Index clustering factors
Inside Oracle Indexing
tree structure definition