101 about Firebird indices

Kjell Rilbe wrote:

Apparently it's not possible to deactivate a primary key index. Seems reasonable. So, how do I rebuild it?

Sean Leyne answers:

Why do you need to rebuild it?

Unlike user indexes, primary key indexes have a fixed selectivity which means that they never become "unbalanced" -- so they never need to be re-indexed.

Ann W. Harrison answers:

Sean, I agree with your conclusion, but have trouble with the reasons. Firebird indexes do not get unbalanced at all. Selectivity has little or nothing to do with index balance.

Primary key indexes - and unique indexes - do not suffer from bad selectivity values because their selectivity is known a priori. Other indexes created on an empty table will have incorrect selectivities after data is loaded and should at a minimum have the selectivity set after a large load or other operation so the stored selectivity matches the actual data.

Now for my contention that Fire indexes don't get unbalanced. The classic Database 101 b-tree adds layers going downward so the path to one record from the top of the tree is say 5 layers while the path to another record is 15 layers. That's why databases don't use simple b-trees.

In the description below, I'm ignoring jump nodes which are important but not significant to the discussion.

A Firebird index starts as a single page containing pairs of values and record identifiers. Eventually, someone wants to add another pair and it doesn't fit on the page. Firebird allocates a new page, copies half the pairs into the new page*, then allocates a second page with two entries, each consisting of a value, a record identifier, and a page number. The page numbers are the original index page and the new index page. The values and record numbers are the first pairs from the lower pages. The the system goes along, happily filling bottom level pages, splitting them as necessary, and adding new pages to the top page. Eventually, the top page fills, splits, and creates another top page with two entries, one for the first half of the former top page and one for the second.

If you think about that for a second, you'll realize that the length from the top to every lowest level entry is exactly the same.

Oh, but what about deleted records. Won't I have an unbalanced index if my key is always increasing with time and I delete the old records. The answer is no, because Firebird does almost exactly the reverse when pages get below 1/3 full - it combines them and removes the pointer from the next level up. That turns out to be very hard and has taken years of Vlad's life to make work under load as pages are added to and removed from indexes.

Kjell Rilbe wrote:

So, when IS it relevant to rebuild a Firebird index, which I understand happens when an index is deactivated and then reactivated?

Ann W. Harrison answers:

Index pages are recombined when twp adjacent pages are each less than (roughly) 30% full. Sorry, I don't remember the exact number and am not going to look it up. That reduces - but does not eliminate - the problem of releasing and index page and instantly needing it back again because the recombined page will only be 60% full. If you've got a table from which records are regularly deleted across the range of the index, gstat may tell you that the index pages are less than 50% full on average. That might be a reason to rebuild the index - unless you expect to create new records with key values all across the range of the index.

In fact, the only reason I can think of for rebuilding an index is that you want a better fill level - whether the source of the problem is deleted records or creating the index before loading a large amount of data in random order relative to the index.

And some people just like doing maintenance and sleep better knowing their indexes have been scrubbed.

Kjell Rilbe wrote:

We're preparing a database for an online system and have imported a large amount of data. The DB file is currently 42 gigabyte. There are two tables with over 110 million records and then some with about 1-2 million records. Most other tables only have a few hundred or a few thousand records. No records are deleted.

Page size is 4096.

Now, after the batch import finishes, I assume the indexes could use with some "conditioning". But what's my best option?

Just set statistics on all of them?
Rebuild all indexes with deactivate/reactivate?
Do a backup/restore cycle with gbak?
Other options?

Would you handle indexes differently depending on selectivity? For example, the "largest" table, with over 110 million records, has a primary key where key values are incremental (not always +1, but always +something). Is it better to leave that index without a rebuild?

Also, what considerations should I make regarding page size? Maybe I should bump up the page size?

Downtime is no problem. More important is to maximize index and query performance before the system is put online.

Ann W. Harrison answers:

If you had asked before you did the load, I would have said, import with indexes off, then create the indexes.

Ad 1. Absolutely.

Ad 2. Will produce dense index pages - which may be good or bad, depending on what you do next. If the database continues to grow significantly across all index ranges, then having some space in the index reduces future splits. If you've got most of the data you expect to have, maximize packing density to reduce the number of index reads.

Ad 3. Overkill.

Ad 4. Don't activate the indexes until the data is stored.

The normal index loading does have an optimization when the page to be split is the last one in the index. Normally, a split puts half the data in each page. If the split is the last page in the index, the new page has only the record that didn't fit. That means that if you load in key order, the index will be created dense.

About page size, it depends on how deep are your indexes.

Kjell Rilbe wrote:

Seems that on the largest table, they all have depth 4. Are there any figures in this report that would warrant a config change or anything, e.g. different page size?

Ann W. Harrison answers:

Yes, I'd double the page size. Every new index access reads the full depth of the index tree, so a four level index is going to be slower than a three level index. With luck, a range retrieval will read across the bottom of the index, but if it conflicts with an index update, it starts again from the top.

Looking at a couple of indexes:

Analyzing database pages ...
Uppgift (567)

    Index IX_PK_Uppgift (0)
        Depth: 4, leaf buckets: 588728, nodes: 131418408
        Average data length: 5.08, total dup: 0, max dup: 0
        Fill distribution:
             0 - 19% = 2968
            20 - 39% = 15
            40 - 59% = 279494
            60 - 79% = 185021
            80 - 99% = 121230

Hmm. That one could certainly be more dense. Is it possible to sort your input data by primary key before you store it? In case it's not clear to everybody a "leaf bucket" is a bottom level index page. Index pages are called buckets for reasons lost in the mists of time. And index trees have the leaves on the bottom. "Old Frothingslosh, the pale stale ale with the foam on the bottom". Maybe I need less coffee.

Index IX_Uppgift_BorttagsuppA5K (8)
    Depth: 4, leaf buckets: 253733, nodes: 136466214
    Average data length: 0.00, total dup: 136444473, max dup: 119722141
    Fill distribution:
         0 - 19% = 127
        20 - 39% = 7685
        40 - 59% = 131314
        60 - 79% = 28000
        80 - 99% = 86607

And this one ... hmm again. Once again, medium fill level, but here the average data length is zero to two decimal places. Which is good, but suggests lots of duplicates.

nodes:     136,466,214 <- that's the number of entries in the index
total dup: 136,444,473 <- that's the number that are identical
max dup:   119,722,141 <- that's the longest chain of a single value.

Amazing that it works at all. Do you know which value has 119 million instances? It was less than 10 years ago when gstat changed its accumulators from 16 to 32 bits.

Index IX_Uppgift_Till├ñggsuppYNC (9)
    Depth: 4, leaf buckets: 196830, nodes: 131804550
    Average data length: 0.02, total dup: 131766961, max dup: 11251
    Fill distribution:
         0 - 19% = 9
        20 - 39% = 1
        40 - 59% = 3217
        60 - 79% = 44
        80 - 99% = 193559

This index looks pretty good. Yes there are a lot of duplicates, but they're distributed pretty evenly.

Sergio wrote:

Hello, I've been reading this FAQ and I undestand that having an index on a boolean field is the worst of the cases. But I've also seen that if you have (as in my case) very little "active" records (MyBoolean = 1) and a LOT of inactive records, is good to have an index to select the active records.

Is that True? Should I create an index on my boolean field?

My paricular case is a calendar with events in some days. I need to select the pending events. They will be always, lets say, less than 100, while the table, as the the years pass will grow a lot with the old (not pending) events...

Ann W. Harrison answers:

In very early versions of Firebird, an index like that would have caused enormous problems during garbage collection. Now there's an algorithm that makes cleaning out old duplicate entries fast enough.

An index will improve performance when you're looking for the 1% of the records that are not like the rest, if you can convince the optimizer to use it. The optimizer may be smart enough to ignore indexes with terrible selectivity, in which case you may need to add a plan to force use of the index. But if you accidentally run a query that looks for the 99% of records that all have the same value (and the optimizer chooses to use that index) you'll get poor performance. You can avoid that by adding a nonsense OR clause - (pending = 0 OR 1 = 2) - that will cause the optimizer to ignore the index.

Besides, adding and dropping an index is pretty trivial. Try it, if you like it, keep it, otherwise get rid of it.

101 about Firebird indices

Kjell Rilbe wrote:

Sean Leyne answers:

Ann W. Harrison answers:

Kjell Rilbe wrote:

Ann W. Harrison answers:

Kjell Rilbe wrote:

Ann W. Harrison answers:

Kjell Rilbe wrote:

Ann W. Harrison answers:

Sergio wrote:

Ann W. Harrison answers:

Related Articles

Author

Reading Time

Published

Category

Tags