You are here:

Character length semantics

Some of the character data types require their length to be predefined during the declaration, e.g. CHAR(n) or VARCHAR(n). You must remember that in multi-byte locales 'n' stands for the number of bytes rather than for the number of characters like it does in the default English U.S. locale. Thus, when declaring variables in multi-byte locales, you must remember that each character will most probably take up more than 1 byte.

For example, Cyrillic letters take up 2 bytes each and a variable declared as CHAR(10) will be able to contain only 5 Cyrillic letters, whereas in the default English U.S. locale it would be able to contain 10 letters.

Other built-in functions and SQL operators also calculate the length of variables by bytes rather than by letters in multi-byte environment. So the length() function will return 10 for a variable declared as CHAR(10), even though in a Cyrillic locale it can accommodate only 5 letters, and in Chinese locale characters can take up to 4 bytes each. If a program attempts to write more bytes than can fit into such a variable, the excess bytes will be discarded.

However, the byte or character semantics may differ depending on the database, e.g. if variables are declared LIKE database columns. Different database engines use different length semantics and definitions for CHAR, VARCHAR, NCHAR and NVARCHAR variables. You should take it into account when developing an application.

Some engines use character length semantics, some - byte length, others can use both. For example, SQL Server uses byte length for the size and supports non-UCS-2 character sets in CHAR, VARCHAR and TEXT columns, and USC-2 characters and char length semantics are applied for NCHAR, NVARCHAR, NTEXT columns.

The list of database engines and their character data type length semantics with comments is given below:

Database engine	Comments	Length semantics
IBM DB2	The character set is specified by the CODSET of CREATE DATABASE.	BLS
Informix	The character set is specified in DB_LOCALE environment variable.	BLS
Microsoft SQL server	Bytes semantics is used for CHAR/VARCHAR columns. The character set used for data storage is specified by database collation. NCHAR/VCHAR sizes are stored in characters; UCS-2 is used to store the data.	BLS/CLS
MySQL	Configuration parameter specifies the server character set used to store data.	CLS
Oracle	The usage of Byte or Character Length semantics can be specified for the whole database or on column level. Database character set is applied for CHAR/VARCHAR columns; NCHAR/NVARCAHR sizes are specified in local character set.	BLS/CLS
PostgreSQL	The database character set is specified by WITH ENCODING of CREATE DATABASE.	CLS

SQL statements and functions are processed using the byte length semantic. For example, the LENGTH() function in Informix returns an integer identifying the value length in bytes, and in Oracle one needs to use LENGTHB() function to get the bytes value, because the LENGTH() function returns the length in characters.

It is advised to use byte length semantic based character data types for databases. This can be important in cases, when the application uses DEFINE LIKE statement which is based on the database schema.

Complex sub-string expressions

When byte length semantics is used, all the character positions correspond to the byte positions. In a multi-byte environment, byte and character positions need special attention; otherwise, invalid data will be returned and processed. For example, you should specify only the first byte of a multi-byte character in a sub-string operator [x, y], otherwise errors may occur.