Understanding character sets

In response to markd's question about character sets, Helen Borrie answers:

I've been putting together a bit of an FAQ about character set anomalies with Firebird 2.1 and above. The intention was for it to go into the Fb 2.1.1 release notes but I didn't get it into a suitable state in time. Here's what's in it so far:

Q. I'm getting this error when querying some of my character data in Firebird 2.1, whereas the same data in Firebird 1.5 and 2.0.x caused no problems

(Error code 335544849) "Malformed string"

A. Strings are written to the database using the defined character set, which will be either the default character set defined for the database or, if present, the character set defined for the column they are written to. For string data to be transliterated correctly for both writing and reading, the connection character set must be the same as the destination character set.

A string is "well-formed" if every byte represents unambiguously a character or part of a character in the character set defined for the data. The character encoding conventions adopted for Firebird and enforced in v.2.1 will detect anomalies in strings and throw the "Malformed string" exception.

If legacy data was written to a non-ASCII character set with NONE specified as the connection character set (the default if it is not specified) then the data would have been written as raw bytes. Under the enforced well-formedness conditions of Firebird 2.1, querying that data causes Firebird to detect the anomalies and throw the exception.

Q. How does transliteration work?

A. Data coming through the connection behaves just like a column in the database that has been defined to use a particular character set. Thus, the assignment of any constant value coming through the connection to a database column is virtually the same as the assignment of a column value to a another column within the database.

The NONE character set is just raw bytes. OCTETS -- which is not available as a connection character set -- is also just raw bytes. The difference between NONE and OCTETS is only in the character that is used to represent a "blank": for OCTETS it is ASCII Null (0x00, CHR(0)) while for NONE it is ASCII Space (0x20, CHR(32)).

Assignment of a string, whether from a connection "column" or a database column, occurs as follows:

from a column value of character set NONE or OCTETS to a column that is defined with a character set that is not NONE or OCTETS undergoes a well-formedness check but is not transliterated. Firebird 2.1 will reject the value if it is not well-formed for the destination character set.
from a specific character set to another specific character set, if the string is well-formed then it is transliterated to the destination character set. The slight difference here is that, if the source of the string is data already stored in the database, it is assumed to be well-formed; whereas data sourced from a "connection column", i.e., a client, is always validated.

Q. I have a problem inserting the "eszet" character 'ß' in a database created from this script written with a text editor:

create database 'c:\fb21_malformed_string.fdb'
  user 'sysdba' password 'masterkey'
  DEFAULT CHARACTER SET UTF8;

create table t1 (
  id integer not null primary key,
  c1 varchar(20)
);

commit;

insert into t1 (id, c1) values (1, 'ß');

commit;

In an isql session I try to run the script:

C:\Program Files\Firebird\Firebird_210_RC2\bin>isql
Use CONNECT or CREATE DATABASE to specify a database
SQL> input c:\fb21_malformed_string.sql;
Statement failed, SQLCODE = -104
Malformed string
After line 8 in file c:\fb21_malformed_string.sql
SQL> show database;
Database: c:\fb21_malformed_string.fdb
         Owner: SYSDBA
PAGE_SIZE 4096
Number of DB pages allocated = 196
Sweep interval = 20000
Forced Writes are ON
Transaction - oldest = 6
Transaction - oldest active = 7
Transaction - oldest snapshot = 7
Transaction - Next = 8
ODS = 11.1
Default Character set: UTF8
SQL> exit;

If the same database has been created with ISO8859_1 as its default character set, it works. Am I doing something wrong with regard to UFT8?

A. Any string data in an input script must be in the character set of the target column. In this case, your keyboard input for creating the script would be in whatever character set your computer is set up for. As you are on Windows, your script editor would most likely have saved the script in ANSI encoding. When the engine came to process the INSERT statement, it recognised the 'ß' entry as a malformed string because its encoding was not UTF8.

You would have had a similar problem trying pass this character interactively in your isql session, since you didn't start isql with the -ch switch, nor set the session's character with SET NAMES. Since the default character set for a connection is NONE, any input from your keyboard (probably ISO8859_1 or WIN1252), where the "eszet" character is single-byte, could not be well-formed UTF8, where it is two bytes with completely different encoding. Constants in DML scripts may not be the only source of malformed strings. Don't overlook the possibility that you might encounter an encoding problem with text in DDL scripts. For example, arguments for the COMMENT statement, embedded comments and text constants in PSQL modules, text constants in CHECK constraints, and so on, will all need to be well-formed now.

Tip

Not all third-party script tools for Firebird provide a capability to save files in UTF8 encoding. On Windows, the text editor notepad.exe does give that option. To be certain, open a new notepad file and save it in UTF8 before creating any text.

Q. My ODS 10 database had UTF8 as its default character set. I backed up my database under FB 2.0 and restored it under Fb 2.1. I ran the metadata script as directed in the release notes and I can connect to the database. However, I am getting the "Malformed string" error when I try to select from tables containing columns that are text BLOBs. That didn't happen under Fb 2.0. What's going on here?

A. In V2.0, if one used NONE or UTF8 as the connection character set and sent non-UTF8 data to a UTF8 BLOB, the data was stored regardless. That was possible when UNICODE_FSS was standing in as the placeholder for the proper UTF8 implementation to come in Firebird 2.1. Firebird 2.1 doesn't accept malformed UTF8 blobs any more. Now, malformed text cannot be stored or read.

To fix the problem, you need to identify the problem data. This may be simple or complicated, depending on how widespread the use of mismatched client/destination character sets has been. Once you know what your target columns are, they can be converted, first to character set NONE and then to the correct character set. In Firebird 2.1 it is possible to cast a BLOB, so this conversion can be done in one or a few steps, in a similar way to how the metadata corrections were done. An update would be along the lines of

update ...
 set blob_field = CAST(
         CAST(blob_field as blob sub_type text character set none)
         as blob sub_type text character set correct_charset)

You can refer to the scripts in your ../misc/metadata directory as a reference.

Understanding character sets

In response to markd's question about character sets, Helen Borrie answers:

Related Articles

Author

Reading Time

Published

Category

Tags