Using KEEPALIVE-sockets to avoid 10054 errors
by Vasiliy Ovchinnikov
In the systems within InterBase or Firebird databases, which are intended for working in either real-time or near-real-time modes, there is a problem of client connection status tracking on the server side, and of forced disconnection in case the client becomes inaccessible due to connection release.
It is important to promptly release the resources busy with such phantom connections, especially when using servers with Classic architecture. If some users connect to the server through an unstable modem connection, then the risk of disconnection becomes rather high.
For instance, a client saves a modified record set, and after UPDATE is executed (while COMMIT is not) the connection is released.
As a rule, client applications in such situations reconnect to the server, but the client (as he/she continues working with the data, after saving which one received error message due to connection fail) will be unable to save changes, since he/she will receive a message about lockout conflict (”lock conflict on update”). The previous connection, which opened the transaction (in the context of which UPDATE was executed, while COMMIT wasn’t), still holds these records.
Connection failures may occur in a local network too, if the hardware (netcards, hubs, commutators) is out of order or not adapted well, and/or due to clutter in the network. In Interbase and Firebird logs, failures of tcp connections are displayed as error 10054 in Windows and 104 in Unix; netbeui failures are displayed as 108/109 errors.
Hung connections control methods
In InterBase and Firebird, the mechanisms of DUMMY-packets or KEEPALIVE-sockets are used for tracking and disabling of such “dead” connections.
In InterBase 5.0 and higher, the mechanism of DUMMY-packets is implemented at the application layer between an InterBase/ Firebird server and a gds32/fbclient client library. It is included in ibconfig/ firebird. conf and is not examined in the present article.
As we know from previous experience, stability of the dummy-packet mechanism (the one implemented in InterBase 5.0 and repeatedly corrected in Firebird 1.5.x) strongly depends on server’s and client’s operating systems, tcp stack versions, and many other conditions. That is to say, effectiveness of such system in a real network tends to zero.
KEEPALIVE-sockets are a more interesting mechanism. Implemented in InterBase 6.0 and higher, it is intended for connection failure tracking. KEEPALIVE is enabled by setting the SO_KEEPALIVE socket option at the opening. There’s no need to manually set it if you use Firebird 1.5 or higher, since it is implemented in the program code of the Firebird server, both for Classic, and for Superserver.
For Interbase and Firebird versions lower than 1.5, in the variant with Classic architecture, an additional setting is necessary. This setting is described below.
In this case, the operating system TCP stack (instead of the Firebird server) becomes responsible for connection status. However, to enable this mechanism, one must adjust KEEPALIVE parameters.
KEEPALIVE-sockets behavior is controlled by the parameter presented in the following table.
|KEEPALIVE_TIME||Time interval, on expiry of which KEEPALIVE-probes start|
|KEEPALIVE_INTERVAL||Time interval between KEEPALIVE-probes|
|KEEPALIVE_PROBES||Number of KEEPALIVE-probes|
The TCP stack tracks the moment when packets stop transmit between the client and the server, by launching the KEEPALIVE timer. As soon as the timer reaches the KEEPALIVE_TIME point, the server TCP stack would execute the first KEEPALIVE probe. Probe is an empty packet with ACK flag sent to a user. If everything is alright on the client side, then the TCP stack on client side sends a response packet with ACK flag, and the server TCP stack resets the KEEPALIVE timer as soon as it receives a response.
If the client does not response to the probe, the probes from the server continue to be sent. Their quantity equals to the KEEPALIVE_PROBES value; they are executed at the KEEPALIVE_INTERVAL time interval. If the client does not respond to the last probe, then after another KEEPALIVE_INTERVAL time expires, the operating system TCP stack closes the connection, and the server (in this case, instance of InterBase or Firebird server) releases all resources busy with provision of this connection.
Thus, a failed client connection will be closed after the following time interval:
KEEPALIVE_TIME+ ( KEEPALIVE_PROBES+1)* KEEPALIVE_INTERVAL.
By default, the parameters values are rather big, and this makes use of them ineffective. For example, the default value of KEEPALIVE_TIME parameter is “2 hours,” both in Linux and in Windows. Actually, 1-2 minutes would be enough to make a decision about forced disconnection of an inaccessible client. On the other hand, KEEPALIVE default settings sometimes cause forced disconnections in Windows networks, which are stay inactive during these 2 hours (of course, one may cast doubt on necessity of such connections in the applications, but this is a different matter).
Below adjustment of these parameters for Windows and Linux operating systems is described.
Setting KEEPAILVE in Linux
KEEPALIVE parameters in Linux can be changed either by file system direct editing / proc, or by calling sysctl.
For the first case, the following lines should be edited:
/proc/sys/net/ipv4/tcp_keepalive_time /proc/sys/net/ipv4/tcp_keepalive_intvl /proc/sys/net/ipv4/tcp_keepalive_probes
For the second case, the following commands should be executed:
sysctl -w net.ipv4.tcp_keepalive_time=value sysctl -w net.ipv4.tcp_keepalive_intvl=value sysctl -w net.ipv4.tcp_keepalive_probes=value
Time value is expressed in seconds.
For automatic setting of these parameters in case of server restarting, add the following should be added:
net.ipv4.tcp_keepalive_intvl = value net.ipv4.tcp_keepalive_time = value net.ipv4.tcp_keepalive_probes = value
Substitute the <value> word with necessary values.
If you use version of Firebird Classic lower than 1.5, then in /etc/xinet.d/firebird the following should be added:
Adjusting KEEPALIVE in Windows 95/98/ME
HKEY_ LOCAL_ MACHINE\ System\ CurrentControlSet\ Services\ VxD\ MSTCP
Everything about adjustment of TCP can be found here:
- KeepAliveTime = millisecondsType: DWORDFor Windows 98, type STRING.Defines connection inactivity time in milliseconds.When it expires, KEEPALIVE-probes start executing.Default value is 2 hours (7200000).
- KeepAliveInterval = 32-digit valueType: DWORDFor Windows 98, STRING type.Defines time between KEEPALIVE-probes (in milliseconds).As soon as the specified KeepAliveTime interval expires,after each KeepAliveInterval time (in milliseconds)KEEPALIVE-probes are sent with maximum numberof MaxDataRetries. If no response comes, the connectioncloses. Default value is 1 second (1000).
- MaxDataRetries = 32-digit valueType: STRINGDefines maximum number of KEEPALIVE-probes.Default value is 5.
Setting KEEPALIVE in Windows 2000/NT/XP
Everything about TCP adjustment:
2000/ NT: http://support.microsoft.com/kb/120642
The MaxDataRetries parameter is substituted by TCPMaxDataRetransmissions.
All other parameters have the same names as in Windows 9x
Setting KEEPALIVE in Windows (for clients)
This setting is optional, but it possibly will reduce number of messages about connection failure if one uses unreliable communications channels. Insert to the register branch:
parameter DisableDHCPMediaSense=1. See a description of this parameter here:
Let’s consider adjustment of Firebird SQL Server 1.5.2 CS under Linux OS.
- Make sure that the DUMMY-packets mechanism is disabled in firebird.conf(the parameter is commented-out)……………..#DummyPacketsInterval=0…………….
Make sure there is the /etc/xinet.d/firebird configuration file
We kept everything unchanged, as it was registered during installation. Nothing needs to be added.
Change the TCP stack parameters:
sysctl -w net.ipv4.tcp_keepalive_time = 15 sysctl -w net.ipv4.tcp_keepalive_intvl = 10 sysctl -w net.ipv4.tcp_keepalive_probes = 5
Connect to any database on the server from any network client
Check traffic on the server using any packet filter.
If parameters specified as /proc/sys/net/tcp_ keepalive_*, within 15 seconds after everything stops in the channel, the server creates a probe. If the client is “alive,” the server receives a response packet. 15 seconds after that, checking repeats, and so on.
If a client is physically turned off (either the multiplexer or the modem unexpectedly turns off - anything is possible), then the server does not receive a response, and the server begins to send probes with 10 seconds interval. If the client does not respond to the fifth probe, then 10 seconds after that, the server process discharges, and releases resources and blockings lockouts. If the client gives any signals and responses at least to the fifth probe (if worst comes to worst), then, after another 15 seconds time-out, the server will begin send probes. And so on.
In conclusion, we would like to give you some advice about how KEEPALIVE values should be selected.
Firstly, determine necessary value of KEEPALIVE_TIME. The more the value is, the later KEEPALIVE-probes would start. If you constantly see 10054/104 errors in the log of the server, and you have to delete them manually, it is recommended to increase the KEEPALIVE_TIME value.
Secondly, the values of the KEEPALIVE_INTERVAL and KEEPALIVE_PROBES should meet your needs concerning before-the-fact release of already hung connections. If your users connect to the server through unreliable channels, then you probably would want to increase number of probes and the interval between them, in order to give the user a chance to detect the failure and reconnect to the server. In case clients use a DSL connection to the Internet, or access a SQL-server through a local network, it is possible to decrease the interval between KEEPALIVE-probes.
General recommendations: if you for no particular reason receive from the clients many error messages, concerning results saving, due to lockout conflict (i.e. there are no concurrent connections working with the same data), then you need to increase system’s reaction to the hung connections release. Practically, the KEEPALIVE_TIME value may be above or equal 1 min. You should yourself estimate the time the longest transaction executes, so that traffic would not be overloaded by KEEPALIVE-checks of normally working connections, which launched long transactions. The KEEPALIVE_INTERVAL value is above or equal 10 seconds, and the KEEPALIVE_PROBES value is above or equal 5 checks. When many users work simultaneously, remember that if you perform checking too frequently, it may considerably increase network traffic.
Also remember that in case your users actively change common data, lockout errors will occur as a result of opti- mum situation. In this case, you would need a correct lockout error handling in the client applications. At the same time, the application should be able to minimize occurrence of such errors.
Examples of default configuration
Finally, here are some more examples of default configurations. Downtime is the time, within which users will be unable to update data, (which by that moment were updated by the transaction opened by the hung connection). Total time is the time, on the expiry of which the hung connection will be closed.
Clients use modem connections; most of transactions in the system are short; downtime is limited by 3 minutes:
KEEPALIVE_TIME 1 minutes KEEPALIVE_PROBES 3 KEEPALIVE_INTERVAL 30 seconds TOTAL 3 minutes
Clients use LAN connection; most of transactions in the system are short; downtime is limited by 2 minutes:
KEEPALIVE_TIME 30 sec KEEPALIVE_PROBES 5 KEEPALIVE_INTERVAL 10 sec TOTAL 90 seconds
Clients use any connections; downtime is not regulated:
KEEPALIVE_TIME12 minutes KEEPALIVE_PROBES 7 KEEPALIVE_INTERVAL 15 sec TOTAL 14 minutes
We hope that the examples we have shown would be enough for correct adjustment of TCP stack KEEPALIVE mechanism.