Careful-write documentation

unordained wrote:

I'll admit, I'm trying to be lazy. Does anyone have precise documentation on the order and algorithm of careful-write? I've found stuff by Ann and Paul (Beach), but they're kind of vague on some points ... the write-ahead-log is disabled, in favor of another mechanism I -almost- understand (and even then, only barely for data pages -- what about writing to an index?).

If anybody knows where such documentation (block- and byte-level) could be found, I'd really appreciate it. If not, I suppose there's always the source code. Any tips on which parts to look at?

Answer from Ann Harrison:

There never was a write ahead log. At one point, there was an optional after image journal which was slow and a required somebody to baby-sit the journal device. Only applications that required two-swing of the ax reliability used the AIJ. During the Borland ages, a guru from Sybase decided that the careful write plus non-deadlocking data structures was theoretically unsound and wrote a write-ahead-log package. It had only two problems. First, it was still one-swing of the ax, and second, he could never make it work.

There is no bits and bytes level to careful write. It's all page level stuff. Reduced to its essence, the algorithm says that the database on disk will always be internally consistent. More practically, that means that when you write a page that references another page, that other page must have been written previously in a state that supports your pointer.

Before you write a page that has a pointer from a record to its back version on another page, that other page has to be on disk. Before you write out a new data page, you must write out a version of a page information page (PIP) that shows the page is in use. The new data page has to be on disk, formatted and happy, before the table's pointer page that references the new page can be written

Inter-page relationships are handled in the code through a dependency graph that the cache manager handles. Before a page is written, the cache manager checks the graph and writes out all pages that page depends on. If a change will create a loop in the graph, as many pages are written immediately as are necessary to avoid the loop.

The tricky bits are identifying dependencies and avoiding the impossible situation - well, those and keeping the system fast. Identifying dependencies just requires meticulous coding. If you have to put a record back version on a different page from the main version, the page with the pointer has a dependency on the page with the back version. If you allocate a new data page, that data page has a dependency on the PIP that records whether the page is in use, and the pointer page that ties the data page into the table has a dependency on the data page.

The impossible situation is one where pages point to each other in a way that can't be separated. Two pages can point to each other - you can have a primary record version on page 124 with its back version on page 125 and a different record with its primary version on page 125 and a back version on 124. The chances that the cache manager will find a cycle in the dependency graph are high, and one page may need to be written twice in the process of flushing the pages out, but it works.

If, on the other hand, you need a double linked chain of pages - index pages come to mind - then each page depends on the other and neither can be written first. In fact, our index pages are double-linked, but the reverse link (high to low in a descending index) is handled as unreliable. It's used in recombining index pages from which values have been removed, but not for backward data scans. The architecture group is currently discussing ways to make the reverse link reliable enough for retrieving data, but we haven't agreed on a solution.

For those who haven't spent an embarrassing part of their adult lives worrying about on disk consistency and double-linked lists, let me try to explain. Lets suppose that the leaf level of an index consists of pages 124, 125, and 126 in that order. Each has a pointer to its left and right neighbor. Now you want to add a new entry to the index. It has to be put on page 125, but page 125 is full, so you need to add a new page between 125 and 126.

Lets expand the example slightly. Assume that each index page can hold only four values - instead of the hundreds that it actually holds. So, page 124 holds A, B, D; page 125 holds F, H, J, L and page 126 holds N, P, R. In my notation below, the page number is followed by the index entries in parenthesis. This "<=>" is a double link. This "<-" and this "->" represent single backward and forward links. Links that skip over pages are over the line of index pages: backward "/----" forward, "-----", and "/-----" double. This structure can be scanned in either direction.

124 (A,B,D) <=> 125 (F,H,J,L) <=> 126 (N,P,R)

You want to store K.

The way the index code handles this sort of problem is:

Read the current PIP (page information page) to find a free page - lets say it's 246
Change the PIP to reflect that page 246 is not available
Set up a dependency so that PIP will be written before the new index page
Format the page as an index page
Copy half the index entries from the page that overflowed onto the new page
Copy the pointer to the next index page from the page that overflowed onto the new page
Make the new page point backward to the page that overflowed
Write the new page

At this point, page 125 still points forward to 126, which points backward to 125. There are two copies of the index entries for J & K, but that doesn't matter because there's no way to get to page 246 - it's not in the upper index level yet and will be skipped by a scan, regardless of direction.
```
                   /---------------------------\
124 (A,B,D) <=> 125 (F,H,J,L) <- 246 (J,K,L) -> 126 (N,P,R)
```
Fix the upper levels so they include the first value of the new page. That change may cause an upper level page to overflow, resulting in the same series of steps at that level, and so on up to the top. If the very top page splits, the index gets a new level.

Now, the upper level entry for J points to 246 rather than 125. Scans still work because anything that starts lower than J will skip node 246 and anything higher than J will skip 125.
Remove the copied index entries from the page that overflowed and change its back pointer to point to the new page.
Write that page

At this point we have a structure that looks like this:
```
                   /--------------------------
124 (A,B,D) <=> 125 (F,H,) <=> 246 (J,K,L) -> 126 (N,P,R)
```
A forward scan still works, but a backward scan that starts with N or higher will never see the values J and L. The code that handles recombinations does a lot of sanity checking and quits when it sees a problem. That strategy doesn't work for record retrievals.
Fix the back pointer on the page after the new page to point to the new page.

Now the structure works again.
```
124 (A,B,D) <=> 125 (F,H,) <=> 246 (J,K,L) <=> 126 (N,P,R)
```

There are a couple of unavoidable awkward situations that occur during page allocation and release, and result in orphan pages and orphan back versions. Orphans are wasted space but do not affect the integrity of the database.

At the page level, gfix will sometimes report orphan pages after a crash.

If the system crashes after a page has been allocated and the PIP written, but before the pointer page that makes the data page part of the table has been written, that data page becomes an orphan. Note that the data on that page is uncommitted because the change that commits a transaction - writing the transaction inventory page with the transaction marked committed - does not happen until all page that were created by the transaction have been written.

If the system crashes after a pointer page has been written, removing an empty data page from a table, but before the PIP has been written to reflect that the page is free.

If the system crashes in the middle of dropping an index or table, gfix may find lots of orphan pages - a single write releases all the pages that were part of the table or index, and that write must happen before any of the PIPs can be changed.

Back versions must be written before the record that points to them and can not be removed until after the pointer to them has been cleared. A crash between those steps makes the back version an orphan - it occupies space but is not connected to a record.

Hope this helps some..

Careful-write documentation

unordained wrote:

Answer from Ann Harrison:

Related Articles

Author

Reading Time

Published

Category

Tags