Morph Firebird into a high performance, distributed, fully redundant database

Both JRDs (DEC Rdb/ELN and Interbase) were designed as elastic database systems for VAXClusters. They are share-nothing engines, communication exclusively through a distributed lock manage and network attach disks. The flaw in this architecture, as Oracle learned with early versions of RAC is that to transfer a disk page between server processors required a disk write followed by a disk read, resulting in an unacceptable latency.

The phases listed below are conceptual. Implementation could combine phases or implement them out of order.

Phase 0 (Hybrid classic / superserver)

Classic synchronizes with lock manager control page locks allowing multiple processes to share the database disk file. Superserver uses lightweight latches to synchronize page access between threads. A hybrid, implemented in Vulcan, supported da hybrid engine in which both mechanisms were active. This is the engine from which subsequent phases evolves.

Phase 1 (Page server)

A network accessible page server process replaces the physical I/O layer (PIO) in the hybrid engine. The page server maintains its own write back cache, so a dirty page transfer between two engines requires only network transfers, leaving the dirty page queued in the page server for eventual write to disk.

Phase 2 (Distributed lock manager)

The Firebird lock manager is replaced by a distributed lock manaager (home brew or existing). Interbase originally used the VMS distributed lock manager on VMS. All other Interbase lock managers were built to emulate VMS distributed lock manage semantics (not a big deal as the DLM was designed by Steve Beckhardt specifically for DBMS systems).

Phase 3 (Page delta logging)

Up to this point, page writes require that full dirty pages be sent to page server to be written. In this phase, the engine sends page changes (deltas) to the page server to applied to database pages. This reduces the network traffic for a TIP change, for example, from 4K bytes to about a dozen bytes. It also allows batching of many page change records into a single physical block for transfer to the page server. The page server applied the page change records to pages in its cache before writing the page or sending a page image to another engine process. For restartabllity, it may be useful to post page changes records to a non-buffered serial log prior to applying commit records.

Phase 4 (Peer to peer distribution)

Another lock type, Existence, is defined between None and Shared, used to indicate that a process has an otherwise unlocked copy of a page in its cache. When an engine modified a page (with an Exclusive lock, of course), page change messages are sent to all processes with Existence locks on that page, enabling other engines to maintain up to date versions of pages not actively in use. If an engine requires access to such a page, it requires only page lock, skipping the page transfer. This does break the layering between the engine and distributed lock manager, requiring a custom or semi-custom lock manager.

Phase 5 (Redundant page servers)

With Phase 4 in place, additional page servers can be be added where each page server has automatic Existence locks on all pages. If necessary, a page server may need to read a page before it can apply a page change record. The tricky part of synchronizing a page server entering (or re-entering!) a Firebird cluster, in which case it needs to request page images from another page server prior to applying page change records.

At this point, Firebird is an elasticallly scalable, fully redundant database systems running on commodity servers without the need to exotic hardware. Multi-client servers can be started on as many machines has necessary to meet performance and availability requirements. Redundant page servers can be added to provide arbitrary levels of durability. The gating limits of performance are network bandwidth and disk bandwidth on page servers to eventually write disk pages (unlike legacy Firebird, however, many page changes could be applied a physical page before it eventually gets written to disk).

And, happiy, this extended architecture is fully compatible with a single process multi-client server running with latches and local disk without a running page server, so it has no impact on the low end and entry level instances.

To elaborate somewhat, a very small number of changes would need to be made to the Firebird engine -- the essential architecture is already in place. The changes are mostly to DPM and below. Ironically, the code to generate page change records used to be part of the code base to support long term journalling. Some of the code may exist.

The major new piece is a separate component, the page server. By phase 5, it will be necessary to integrate page change and lock manager traffic, so perhaps a better name would be page/lock server. Code historians may find existing references to an ancient page/lock server, circa 1987, which I never completed.

At the end of phase 5, Firebird would run exactly as it does now except:

  1. Pages would be fetched from the page/lock server rather than disk.
  2. Page change records would be queued for delivery to the page/lock server
  3. The messaage queue to the page/lock server would be flushed when full or when a page with page changes is released
  4. The engine would post page changes from other instances to pages in cache with an existence lock.

The really nifty part of the architecture is that pages themselves are never written by an engine. The page/lock server applies page changes, writing its updated pages to disk at its convenience. In the meantime, each engine is applying page changes to pages in cache (locked with Existence), so an Engine can acquire a Shared or Exclusive lock to page by requesting (and receiving) a a lock on the page. This works because:

  1. Page change records will always be transmitted before that page is unlocked by an engine.
  2. If another engine requests a lock on that page, it will always receive any pending page change records before the lock grant record.

The net volume of network traffic will be vastly less than current Firebird because a) only page changes are transmitted, and b) change page records are blocked into Ethernet packet size blocks (when possible). Between that and the fact that a GB Ethernet round trip takes about 100 microseconds and a disk transfer about 6 milliseconds (on a good and lucky day). So many fewer I/Os and much faster I/Os.

The tricky parts all revolve around the issues of multiple/page lock servers and have almost no effect on the Firebird engine, per se.

I leave it to the project on whether or not it wants to pursue this architecture. If so, I'm available as an advisor. The design is intended to allow implementation in phases, so the system is never broken. All that is necessarily is will and flexibility of mind.