by Paul Beach

Introduction

A common problem when developing new applications is determining the amount of hardware that will be required to run the application, in particular the amount of disk space. The amount of disk space required can also indicate that multiple files could be used to reduce load on the disk drives and shorten the amount of time required for back ups.

Obviously the simplest way to determine the amount of disk space required is to wait until the application has been developed and installed, and then look at the database size and growth patterns.

The alternative is to estimate the amount of disk space required in advance. This can be done at several phases in a project, from requirements definition or proposal response, right through to the coding phase.

The estimates typically becoming more accurate with each phase.

Database sizing is the most reliable method of determining disk space requirements for an application. An algorithm which takes into account the physical storage of table and indexes is used to calculate an estimate of the disk space requirements.

However the result is only an approximation of the amount of disk space required. Generally a margin of error of at least plus or minus 10% is expected when an algorithm (such as the one below) is used and is then compared to the real database.

Note that estimates calculated refer to the space occupied by the database data. You also need to take into account how the database is extended. Currently Firebird on *nix extends the database one page at a time. In the old days on VMS and MPE/XL the database grew by doubling its previous size until the increment was 16,000 bytes. The file was then extended 16,000 bytes at a time. However nowadays on Windows (since 2.1) the extension of the database file is determined by the configuration parameter DatabaseGrowthIncrement, when the datbase engine needs to allocate more disk space, it allocates a block that is 1/16th of the space already allocated, but not less than 128KB and not greater than the DatabaseGrowthIncrement value. The DatabaseGrowthIncrement value can be increased to increase the maximum size of newly-allocated blocks to more than the default 128 MB. You can also set the value to to zero to disable preallocation. Note shadow database files are not preallocated.

Database sizing helps in choosing an initial database size, which takes into account the initial amount of data and the anticipated size of the data. The algorithm does not take into account the space allocated to the database but not yet used by the data. The space allocated depends upon the number of pages allocated at creation time, if specified, as well as the size of the extensions to the database (see above).

The margin of error in the database sizing algorithm depends upon the level of confidence in the correctness of the information available. The best approach is to do some “what if” analysis to determine the impact of the margin of error. If the estimate is out by 50%, and the database would still fit in the available space, there would be little cause for worry. However, if an error of 15% would require a new disk drive, this should be highlighted as a strong possibility.

In order to size a database the following information is needed.

  • The number of tables in the database
  • The datatypes and sizes of each field in each table
  • The indexes defined on each table
  • The expected average number of records for each table
  • The expected peak number of records for each table (for worst case analysis)

There are certain factors influencing the database size which cannot always be accurately estimated, for example the amount of data and index compression. Where data and index compression is an issue, the database should be sized both for the anticipated compression and also for no compression in order to obtain an upper and lower bound on the database size.

The sizing formula described in this article is based on an analysis of the way data is stored in a Firebird database. It includes overhead per key, per record and per block, it takes into account average fill factors and allows for data compression. The formula was tested by applying it to a number of databases, some which already existed (for other purposes) and some which were created just to check the accuracy of the algorithm. The margin of error was consistently within plus or minus 10%. The results are listed below.

When using the sizing algorithm, consideration should be given to the following:

  1. The amount of key and data compression; no compression provides the worst case analysis, otherwise an estimate of the percentage of compression is necessary. The accuracy of the results is obviously dependent upon the accuracy of the estimated compression size.
  2. The overhead for versioning. A version stores the delta change in a record, so the size of the version is a function of the changes made.
  3. The size of the metadata. In most cases the size of the data should dwarf the size of the metadata. The size of an empty database that is used in the algorithm should be sufficient for small simple databases. If the database is very complex, with many tables, trigger, stored procedures and user defined functions etc and only relatively small quantities of data, then the database should be prototyped to obtain the size of the metadata.

Tests Performed

The tests wre performed on several different databases. In each of the following tests the size of the database is expressed in megabytes.

Fixed length data

Fixed length data
  Estimated Actual Difference
No index 1.92 1.99 -4%
One index 2.27 2.24 +1%
Two indexes 3.03 2.81 +8%
Three indexes 3.71 3.43 +8%

String data with index on a and b

String data with index on a and b
  Estimated Actual Difference
Uncompressed string length 10 3.44 3.37 +2%
Uncompressed string length 40 5.01 4.69 +7%
Compressed string length 40 3.00 3.00 0%

Compressed string data length 40

String data with index on a and b
  Estimated Actual Difference
With index compression 5.71 5.49 +4%
With no index compression 6.84 6.75 +1%

Fixed length data multiple tables

Fixed length data multiple tables
  Estimated Actual Difference
two indexes 2.89 2.82 +3%

TPCB database for 1 branch

TPCB database for 1 branch
  Estimated Actual Difference
With 1K pages 7.90 7.85 +1%
With 8K pages 8.75 8.07 +8%

TPCB database for 4 branches

TPCB database for 4 branches
  Estimated Actual Difference
With 1K pages 30.98 31.73 -2%
With 8K pages 31.61 30.93 +2%

In each of the above cases a fill factor of 70% was used and an index fill factor of 60% was used. This appears to be a reasonable average. However the range for both of these values is between 55% and 75% depending upon the data and the application. For greater accuracy in estimating the database size, create a prototype database, obtain the correct size of the database overhead, and add a few blocks worth of data to obtain more accurate fill factors. The fill factors can be obtained by running GSTAT.

Sizing Algorithm

[DBO] Size of database overhead for metadata – a function of the page size

1K = 200 Pages
2K = 170 Pages
4K = 150 Pages
8K = 140 Pages

[DB] Size of the database

[T] Size of table

[A] Size of attribute

2 for integer
4 for float
4 for long integer
8 for double
n for character (char and varchar)

For both char and varchar fields n should be the average compressed length. For Blobs and arrays – estimate the size

[D] Number of data pages

[I] Number of index pages

[P] page size (1K, 2K, 4K, 8K)

R # records/table

RP # records/page

KP # keys/page

  1. The size of the database in pages consists of the overhead for the metadata and the sume of the sizes of each of the tables

    [DB] = [DBO] + Sum ([T])

  2. The size of a table is the number of pages of data added to the number of pages for each of the indexes on the table

    [T] = [D] + sum ([I])

    (for all the indexes on the table)

  3. The number of data pages for a table is the number of the records divided by the number of records per data page

    [D] = R/RP

    RP = (P * 0.70 –8) / (18 + sum ([A]))

  4. The number of index pages for an index is the number of records divided by the number of keys per page

    [I] = R/KP

    KP = (P * 0.60 – 20) / (7 + (avg compressed key size)

  5. When calculating compressed key size, remember that 3 bytes are used for compression, so the avg compressed key size is the average number of bytes remaining after compression plus 3. To obtain an upper bound always do the calculation with no compression as well.

Like this post? Share on: TwitterFacebookEmail


Related Articles


Author

Paul Beach

Reading Time

~6 min read

Published

Category

Articles

Tags