Using KEEPALIVE-sockets to avoid 10054 errors

by Vasiliy Ovchinnikov

Introduction

In the systems within InterBase or Firebird databases, which are intended for working in either real-time or near-real-time modes, there is a problem of client connection status tracking on the server side, and of forced disconnection in case the client becomes inaccessible due to connection release.

It is important to promptly release the resources busy with such phantom connections, especially when using servers with Classic architecture. If some users connect to the server through an unstable modem connection, then the risk of disconnection becomes rather high.

For instance, a client saves a modified record set, and after UPDATE is executed (while COMMIT is not) the connection is released.

As a rule, client applications in such situations reconnect to the server, but the client (as he/she continues working with the data, after saving which one received error message due to connection fail) will be unable to save changes, since he/she will receive a message about lockout conflict (”lock conflict on update”). The previous connection, which opened the transaction (in the context of which UPDATE was executed, while COMMIT wasn’t), still holds these records.

Connection failures may occur in a local network too, if the hardware (netcards, hubs, commutators) is out of order or not adapted well, and/or due to clutter in the network. In Interbase and Firebird logs, failures of tcp connections are displayed as error 10054 in Windows and 104 in Unix; netbeui failures are displayed as 108/109 errors.

Hung connections control methods

In InterBase and Firebird, the mechanisms of DUMMY-packets or KEEPALIVE-sockets are used for tracking and disabling of such “dead” connections.

In InterBase 5.0 and higher, the mechanism of DUMMY-packets is implemented at the application layer between an InterBase/ Firebird server and a gds32/fbclient client library. It is included in ibconfig/ firebird. conf and is not examined in the present article.

Note

As we know from previous experience, stability of the dummy-packet mechanism (the one implemented in InterBase 5.0 and repeatedly corrected in Firebird 1.5.x) strongly depends on server’s and client’s operating systems, tcp stack versions, and many other conditions. That is to say, effectiveness of such system in a real network tends to zero.

KEEPALIVE-sockets are a more interesting mechanism. Implemented in InterBase 6.0 and higher, it is intended for connection failure tracking. KEEPALIVE is enabled by setting the SO_KEEPALIVE socket option at the opening. There’s no need to manually set it if you use Firebird 1.5 or higher, since it is implemented in the program code of the Firebird server, both for Classic, and for Superserver.

For Interbase and Firebird versions lower than 1.5, in the variant with Classic architecture, an additional setting is necessary. This setting is described below.

In this case, the operating system TCP stack (instead of the Firebird server) becomes responsible for connection status. However, to enable this mechanism, one must adjust KEEPALIVE parameters.

KEEPALIVE description

KEEPALIVE-sockets behavior is controlled by the parameter presented in the following table.

Parameter Description
KEEPALIVE_TIME Time interval, on expiry of which KEEPALIVE-probes start
KEEPALIVE_INTERVAL Time interval between KEEPALIVE-probes
KEEPALIVE_PROBES Number of KEEPALIVE-probes

The TCP stack tracks the moment when packets stop transmit between the client and the server, by launching the KEEPALIVE timer. As soon as the timer reaches the KEEPALIVE_TIME point, the server TCP stack would execute the first KEEPALIVE probe. Probe is an empty packet with ACK flag sent to a user. If everything is alright on the client side, then the TCP stack on client side sends a response packet with ACK flag, and the server TCP stack resets the KEEPALIVE timer as soon as it receives a response.

If the client does not response to the probe, the probes from the server continue to be sent. Their quantity equals to the KEEPALIVE_PROBES value; they are executed at the KEEPALIVE_INTERVAL time interval. If the client does not respond to the last probe, then after another KEEPALIVE_INTERVAL time expires, the operating system TCP stack closes the connection, and the server (in this case, instance of InterBase or Firebird server) releases all resources busy with provision of this connection.

Thus, a failed client connection will be closed after the following time interval:

KEEPALIVE_TIME+ ( KEEPALIVE_PROBES+1)* KEEPALIVE_INTERVAL.

By default, the parameters values are rather big, and this makes use of them ineffective. For example, the default value of KEEPALIVE_TIME parameter is “2 hours,” both in Linux and in Windows. Actually, 1-2 minutes would be enough to make a decision about forced disconnection of an inaccessible client. On the other hand, KEEPALIVE default settings sometimes cause forced disconnections in Windows networks, which are stay inactive during these 2 hours (of course, one may cast doubt on necessity of such connections in the applications, but this is a different matter).

Below adjustment of these parameters for Windows and Linux operating systems is described.

Setting KEEPAILVE in Linux

KEEPALIVE parameters in Linux can be changed either by file system direct editing / proc, or by calling sysctl.

For the first case, the following lines should be edited:

/proc/sys/net/ipv4/tcp_keepalive_time
/proc/sys/net/ipv4/tcp_keepalive_intvl
/proc/sys/net/ipv4/tcp_keepalive_probes

For the second case, the following commands should be executed:

sysctl -w net.ipv4.tcp_keepalive_time=value
sysctl -w net.ipv4.tcp_keepalive_intvl=value
sysctl -w net.ipv4.tcp_keepalive_probes=value

Time value is expressed in seconds.

For automatic setting of these parameters in case of server restarting, add the following should be added:

net.ipv4.tcp_keepalive_intvl = value
net.ipv4.tcp_keepalive_time = value
net.ipv4.tcp_keepalive_probes = value

Substitute the <value> word with necessary values.

If you use version of Firebird Classic lower than 1.5, then in /etc/xinet.d/firebird the following should be added:

FLAGS=REUSE KEEPALIVE

Adjusting KEEPALIVE in Windows 95/98/ME

Register branch:

HKEY_ LOCAL_ MACHINE\ System\ CurrentControlSet\ Services\ VxD\ MSTCP

Everything about adjustment of TCP can be found here:

http://support.microsoft.com/default.aspx?scid=kb;en-us;158474

Parameters:

  • KeepAliveTime = milliseconds
    Type: DWORD
    For Windows 98, type STRING.
    Defines connection inactivity time in milliseconds.
    When it expires, KEEPALIVE-probes start executing.
    Default value is 2 hours (7200000).
  • KeepAliveInterval = 32-digit value
    Type: DWORD
    For Windows 98, STRING type.
    Defines time between KEEPALIVE-probes (in milliseconds).
    As soon as the specified KeepAliveTime interval expires,
    after each KeepAliveInterval time (in milliseconds)
    KEEPALIVE-probes are sent with maximum number
    of MaxDataRetries. If no response comes, the connection
    closes. Default value is 1 second (1000).
  • MaxDataRetries = 32-digit value
    Type: STRING
    Defines maximum number of KEEPALIVE-probes.
    Default value is 5.

Setting KEEPALIVE in Windows 2000/NT/XP

Register branch:

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\.

Everything about TCP adjustment:

2000/ NT: http://support.microsoft.com/kb/120642

XP: http://support.microsoft.com/kb/314053

The MaxDataRetries parameter is substituted by TCPMaxDataRetransmissions.

All other parameters have the same names as in Windows 9x

Setting KEEPALIVE in Windows (for clients)

This setting is optional, but it possibly will reduce number of messages about connection failure if one uses unreliable communications channels. Insert to the register branch:

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters

parameter DisableDHCPMediaSense=1. See a description of this parameter here:

http://support.microsoft.com/?scid =kb%3Bru%3B239924&x=13&y=14

Example

Let’s consider adjustment of Firebird SQL Server 1.5.2 CS under Linux OS.

  • Make sure that the DUMMY-packets mechanism is disabled in firebird.conf
    (the parameter is commented-out)
    ……………..
    #DummyPacketsInterval=0
    …………….
  • Make sure there is the /etc/xinet.d/firebird configuration file

    We kept everything unchanged, as it was registered during installation. Nothing needs to be added.

  • Change the TCP stack parameters:

    sysctl -w net.ipv4.tcp_keepalive_time = 15
    sysctl -w net.ipv4.tcp_keepalive_intvl = 10
    sysctl -w net.ipv4.tcp_keepalive_probes = 5
    
  • Connect to any database on the server from any network client

  • Check traffic on the server using any packet filter.

    If parameters specified as /proc/sys/net/tcp_ keepalive_*, within 15 seconds after everything stops in the channel, the server creates a probe. If the client is “alive,” the server receives a response packet. 15 seconds after that, checking repeats, and so on.

  • If a client is physically turned off (either the multiplexer or the modem unexpectedly turns off - anything is possible), then the server does not receive a response, and the server begins to send probes with 10 seconds interval. If the client does not respond to the fifth probe, then 10 seconds after that, the server process discharges, and releases resources and blockings lockouts. If the client gives any signals and responses at least to the fifth probe (if worst comes to worst), then, after another 15 seconds time-out, the server will begin send probes. And so on.

Guidelines

In conclusion, we would like to give you some advice about how KEEPALIVE values should be selected.

Firstly, determine necessary value of KEEPALIVE_TIME. The more the value is, the later KEEPALIVE-probes would start. If you constantly see 10054/104 errors in the log of the server, and you have to delete them manually, it is recommended to increase the KEEPALIVE_TIME value.

Secondly, the values of the KEEPALIVE_INTERVAL and KEEPALIVE_PROBES should meet your needs concerning before-the-fact release of already hung connections. If your users connect to the server through unreliable channels, then you probably would want to increase number of probes and the interval between them, in order to give the user a chance to detect the failure and reconnect to the server. In case clients use a DSL connection to the Internet, or access a SQL-server through a local network, it is possible to decrease the interval between KEEPALIVE-probes.

General recommendations: if you for no particular reason receive from the clients many error messages, concerning results saving, due to lockout conflict (i.e. there are no concurrent connections working with the same data), then you need to increase system’s reaction to the hung connections release. Practically, the KEEPALIVE_TIME value may be above or equal 1 min. You should yourself estimate the time the longest transaction executes, so that traffic would not be overloaded by KEEPALIVE-checks of normally working connections, which launched long transactions. The KEEPALIVE_INTERVAL value is above or equal 10 seconds, and the KEEPALIVE_PROBES value is above or equal 5 checks. When many users work simultaneously, remember that if you perform checking too frequently, it may considerably increase network traffic.

Also remember that in case your users actively change common data, lockout errors will occur as a result of opti- mum situation. In this case, you would need a correct lockout error handling in the client applications. At the same time, the application should be able to minimize occurrence of such errors.

Examples of default configuration

Finally, here are some more examples of default configurations. Downtime is the time, within which users will be unable to update data, (which by that moment were updated by the transaction opened by the hung connection). Total time is the time, on the expiry of which the hung connection will be closed.

  • Clients use modem connections; most of transactions in the system are short; downtime is limited by 3 minutes:

    KEEPALIVE_TIME 1 minutes
    KEEPALIVE_PROBES 3
    KEEPALIVE_INTERVAL 30 seconds
    TOTAL 3 minutes
    
  • Clients use LAN connection; most of transactions in the system are short; downtime is limited by 2 minutes:

    KEEPALIVE_TIME 30 sec
    KEEPALIVE_PROBES 5
    KEEPALIVE_INTERVAL 10 sec
    TOTAL 90 seconds
    
  • Clients use any connections; downtime is not regulated:

    KEEPALIVE_TIME12 minutes
    KEEPALIVE_PROBES 7
    KEEPALIVE_INTERVAL 15 sec
    TOTAL 14 minutes
    

We hope that the examples we have shown would be enough for correct adjustment of TCP stack KEEPALIVE mechanism.