[lex.char]

5 Lexical conventions [lex]

5.3 Characters [lex.char]


5.3.1 Character sets [lex.charset]

5.3.2 Universal character names [lex.universal.char]


5.3.1 Character sets [lex.charset]

The translation character set consists of the following elements:

  • each abstract character assigned a code point in the Unicode codespace as specified in the Unicode Standard, and
  • a distinct character for each Unicode scalar value not assigned to an abstract character.

[Note 1:

Unicode code points are integers in the range [0, 10FFFF] (hexadecimal).

A surrogate code point is a value in the range [D800, DFFF] (hexadecimal).

A Unicode scalar value is any code point that is not a surrogate code point.

— end note]

The basic character set is a subset of the translation character set, consisting of 99 characters as specified in Table 1.

In this document, glyphs are often used to identify elements of the basic character set.

[Note 2:

Unicode short names are given only as a means to identifying the character; the numerical value has no other meaning in this context.

— end note]

Table 1 — Basic character set [tab:lex.charset.basic]

character

glyph

U+0009

character tabulation

U+000b

line tabulation

U+000c

form feed

U+0020

space

U+000a

line feed

new-line

U+0021

exclamation mark

!

U+0022

quotation mark

"

U+0023

number sign

#

U+0024

dollar sign

$

U+0025

percent sign

%

U+0026

ampersand

&

U+0027

apostrophe

'

U+0028

left parenthesis

(

U+0029

right parenthesis

)

U+002a

asterisk

*

U+002b

plus sign

+

U+002c

comma

,

U+002d

hyphen-minus

-

U+002e

full stop

.

U+002f

solidus

/

U+0030 ..

U+0039

digit zero .. nine

0 1 2 3 4 5 6 7 8 9

U+003a

colon

:

U+003b

semicolon

;

U+003c

less-than sign

<

U+003d

equals sign

=

U+003e

greater-than sign

>

U+003f

question mark

?

U+0040

commercial at

@

U+0041 ..

U+005a

latin capital letter a .. z

A B C D E F G H I J K L M

N O P Q R S T U V W X Y Z

U+005b

left square bracket

[

U+005c

reverse solidus

\

U+005d

right square bracket

]

U+005e

circumflex accent

^

U+005f

low line

_

U+0060

grave accent

`

U+0061 ..

U+007a

latin small letter a .. z

a b c d e f g h i j k l m

n o p q r s t u v w x y z

U+007b

left curly bracket

{

U+007c

vertical line

|

U+007d

right curly bracket

}

U+007e

tilde

~

The basic literal character set consists of all characters of the basic character set, plus the control characters specified in Table 2.

Table 2 — Additional control characters in the basic literal character set [tab:lex.charset.literal]

U+0000

null

U+0007

alert

U+0008

backspace

U+000d

carriage return

The ordinary literal encoding is the encoding applied to an ordinary character or string literal.

The wide literal encoding is the encoding applied to a wide character or string literal.

A literal encoding or a locale-specific encoding of one of the execution character sets ([character.seq]) encodes each element of the basic literal character set as a single code unit with non-negative value, distinct from the code unit for any other such element.

[Note 3:

A character not in the basic literal character set can be encoded with more than one code unit; the value of such a code unit can be the same as that of a code unit for an element of the basic literal character set.

— end note]

The U+0000 null character is encoded as the value 0.

No other element of the translation character set is encoded with a code unit of value 0.

The code unit value of each decimal digit character after the digit 0 (U+0030) is one greater than the value of the previous.

The ordinary and wide literal encodings are otherwise implementation-defined.

For a UTF-8, UTF-16, or UTF-32 literal, the implementation shall encode the Unicode scalar value corresponding to each character of the translation character set as specified in the Unicode Standard for the respective Unicode encoding form.

5.3.2 Universal character names [lex.universal.char]

n-char:
any member of the translation character set except the U+007d right curly bracket or new-line character

The universal-character-name construct provides a way to name any element in the translation character set using just the basic character set.

A universal-character-name of the form \u hex-quad, \U hex-quad hex-quad, or \u{simple-hexadecimal-digit-sequence} designates the character in the translation character set whose Unicode scalar value is the hexadecimal number represented by the sequence of hexadecimal-digits in the universal-character-name.

The program is ill-formed if that number is not a Unicode scalar value.

A universal-character-name that is a named-universal-character designates the corresponding character in the Unicode Standard (chapter 4.8 Name) if the n-char-sequence is equal to its character name or to one of its character name aliases of type “control”, “correction”, or “alternate”; otherwise, the program is ill-formed.

[Note 2:

These aliases are listed in the Unicode Character Database's NameAliases.txt.

None of these names or aliases have leading or trailing spaces.

— end note]