Atli Oddsson wrote:
The subject pretty much sums up the question, "How to know when an index has become unbalanced?"
I know how to rebuild an index and update the statistics, but how can I know the state of the index itself and whether it needs rebuilding?
Example: I have a primary key on an autoincrement column. I'd assume the index tree would get unbalanced rather quickly - right?
Ann W. Harrison answers:
The short answer is that Firebird indexes basically don't go out of balance. If you took a really bad data structures course in the early dark ages of computer science, you may have seen indexes that grew at the bottom and developed some branches that were much longer than others. Professional indexes haven't worked like that in the last 30 years. They grow from the bottom up and all branches are always the same length. Until 1987 or so, InterBase indexes would get unbalanced in a way if you stored keys in ascending order (e.g. generator or timestamp based) and deleted the old rows. All branches would be the same length, but there would be large hollow areas of empty pages on the left hand (low) side. But in 1987 or 1988, we added index recompression, so that doesn't happen any more.
Until Firebird, building an index in ascending order (like generator or timestamp keys) would create an index that was balanced but only half full. When an index page is about to overflow, Firebird splits it. Normally, half the rows go into the new page and half remain in the old page so new entries can be accommodated. But with an ascending key, new entries are always bigger than existing entries, so empty space in the old page is never used and the index is built half full. Now Firebird notices that a split comes at the end of the index, and puts only the newest entry in the new page, so ascending indexes are built very nearly full.
However, random deletions and modification can lead to partly filled indexes. The tool for identifying such problems is gstat. It reports the fill level in index pages - if it is below 70%, you should probably rebuild the index unless the keys are very large - large keys don't pack well into pages.
Atli Oddsson continues:
Considering your answer, is it correct to say that SET STATISTICS requests are now useless?
Helen Borrie answers:
No, SET STATISTICS hasn't anything to do with whether indexes are "balanced" or not (which is physical thing). It recalculates the selectivity of the index, i.e. the distribution of values across the index or, generically, its usefulness (high=good, low=bad). A unique index has the highest selectivity since there is at most one occurrence of its value[s]. If you are looking at the statistics stored in rdb$indices, those with the high numbers are the least selective.
Ann W. Harrison answers:
No - as Helen said, "balance" is a physical attribute of an index and indicates how deep the index is - that is, how many pages must be read to go from the top of the index to the bottom. Statistics are logical and reflect the number of distinct values in the index vs. the total number of entries.
Eons ago, before Jim started writing relational databases, he worked on a query language that ran against ISAM files and CODASYL databases. It included an optimizer so Jim wanted to know how many records were in each file or record type... The CODASYL system kept statistics, the ISAM system didn't. The project lead of the ISAM project didn't want to keep statistics. Jim said they could be approximate. He said, "OK, then thirteen." Later profiling showed that under load, the CODASYL system spent about 10% of its time maintaining statistics or waiting for other transactions to finish updating statistics.
The world is very different now, but Firebird still keeps fuzzy optimization statistics - for cardinality, it uses a rough page count. For selectivity, it uses the selectivity set when the index was created. The latter is OK if you store data then create indexes because subsequent data is likely to follow the rough selectivity of the original data. It's not good at all if you create indexes before you store data. That's when SET STATISTICS is your friend.