Character Length Semantics

Some of the character data types require their length to be predefined during the declaration, e.g. CHAR(n) or VARCHAR(n). You must remember that in multi-byte locales 'n' stands for the number of bytes rather than for the number of characters like it does in the default English U.S. locale. Thus, when declaring variables in multi-byte locales you must remember that each character will most probably take up more than 1 byte.

For example, Cyrilic letters take up 2 bytes each and a variable declared as CHAR(10) will be able to contain only 5 Cyrilic letters, whereas in the default English U.S. locale it would be able to contain 10 letters.

 

Other built-in functions and SQL operators also calculate the length of variables by bytes rather than by letters in multi-byte environment. So the length() function will return 10 for a variable declared as CHAR(10), even though in a Cyrilic locale it can accommodate only 5 letters and in Chinese locale characters can take up to 4 bytes each. If a program attempts to write more bytes than can fit into such a variable, the excess bytes will be discarded.

 

However, the byte or character semantics may differ depending on the database, e.g. if variables are declared LIKE database columns. Different database engines use different length semantics and definitions for CHAR, VARCHAR, NCHAR and NVARCHAR variables. You should take it into account when developing an application.

Some engines use character length semantics, some - byte length, the others can use both. For example, SQL Server uses byte length for the size and supports non-UCS-2 character sets in CHAR, VARCHAR and TEXT columns, and USC-2 characters and char length semantics are applied for NCHAR, NVARCHAR, NTEXT columns.

 

The list of database engines and their character data type length semantics with comments is given below:

 

Database Engine

Comments

Length Semantics

IBM DB2

The character set is specified by the CODSET of CREATE DATABASE

BLS

Informix

The character set is specified in DB_LOCALE environment variable

BLS

Microsoft SQL server

Bytes semantics is used for CHAR/VARCHAR columns. The character set used for data storage is specified by database collation. NCHAR/VCHAR sizes are stored in characters; UCS-2 is used to store the data.

BLS/CLS

MySQL

Configuration parameter specifies the server character set used to store data

CLS

Oracle

The usage of Byte or Character Length semantics can be specified for the whole database or on column level. Database character set is applied for CHAR/VARCHAR columns; NCHAR/NVARCAHR sizes are specified in local character set.

BLS/CLS

PostgreSQL

The database character set is specified by WITH ENCODING of CREATE DATABASE

CLS

 

SQL statements and functions are processed using the byte length semantic. For example, the LENGTH() function in Informix returns an integer identifying the value length in bytes, and in Oracle one needs to use LENGTHB() function to get the bytes value, because the LENGTH() function returns the length in characters.

It is advised to use byte length semantic based character data types for databases. This can be important in cases, when the application uses DEFINE LIKE statement which is based on the database schema.

Complex Substring Expressions

When byte length semantics is used, all the character positions correspond to the byte positions. In a multi-byte environment, byte and character positions need special attention; otherwise, invalid data will be returned and processed. For example, you should specify only the first byte of a multi-bite character in a substring operator [x, y], otherwise errors may occur.