cpp/language/charset

This page describes several character sets specified by the C++ standard.

{{rrev|since=c++23|

Translation character set
The translation character set consists of the following elements:
 * each abstract character assigned a code point in the Unicode codespace, and
 * a distinct character for each Unicode scalar value not assigned to an abstract character.

The translation character set is a superset of the basic character set and the basic literal character set (see below). }}

Basic character set
The basic character set consists of the following 96 characters:

Basic literal character set
The basic literal character set consists of all characters of the basic character set, plus the following control characters:

Execution character set
The execution character set and the execution wide-character set are supersets of the basic literal character set. The encodings of the execution character sets and the sets of additional elements (if any) are locale-specific. Each element of execution wide-character set must be representable as a distinct code unit.

Code unit and literal encoding
A code unit is an integer value of character type. Characters in a other than a multicharacter or non-encodable character literal or in a  are encoded as a sequence of one or more code units, as determined by the encoding prefix; this is termed the respective literal encoding.

A literal encoding or a locale-specific encoding of one of the execution character sets encodes each element of the basic literal character set as a single code unit with non-negative value, distinct from the code unit for any other such element. A character not in the basic literal character set can be encoded with more than one code unit; the value of such a code unit can be the same as that of a code unit for an element of the basic literal character set. The encodings of the execution character sets can be unrelated to any literal encoding.

The ordinary literal encoding is the encoding applied to an ordinary character or string literal. The wide literal encoding is the encoding applied to a wide character or string literal.

The U+0000 NULL character is encoded as the value 0. No other element of the translation character set is encoded with a code unit of value 0. The code unit value of each decimal digit character after the digit 0 (U+0030) shall be one greater than the value of the previous. The ordinary and wide literal encodings are otherwise implementation-defined.

For a UTF-8, UTF-16, or UTF-32 literal, the UCS scalar value corresponding to each character of the translation character set is encoded as specified in ISO/IEC 10646 for the respective UCS encoding form.