[lex.pptoken]

5 Lexical conventions [lex]

5.5 Preprocessing tokens [lex.pptoken]

A preprocessing token is the minimal lexical element of the language in translation phases 3 through 5.

The categories of preprocessing token are: header names, placeholder tokens produced by preprocessing import and module directives (import-keyword, module-keyword, and export-keyword), identifiers, preprocessing numbers, character literals (including user-defined character literals), string literals (including user-defined string literals), preprocessing operators and punctuators, and single non-whitespace characters that do not lexically match the other preprocessing token categories.

If a U+0027 apostrophe, a U+0022 quotation mark, or any character not in the basic character set matches the last category, the program is ill-formed.

Preprocessing tokens can be separated by whitespace; this consists of comments ([lex.comment]), or whitespace characters (U+0020 space, U+0009 character tabulation, new-line, U+000b line tabulation, and U+000c form feed), or both.

[Note 1:

As described in [cpp], in certain circumstances during translation phase 4, whitespace (or the absence thereof) serves as more than preprocessing token separation.

Whitespace can appear within a preprocessing token only as part of a header name or between the quotation characters in a character literal or string literal.

— end note]

Each preprocessing token that is converted to a token ([lex.token]) shall have the lexical form of a keyword, an identifier, a literal, or an operator or punctuator.

The import-keyword is produced by processing an import directive ([cpp.import]), the module-keyword is produced by preprocessing a module directive ([cpp.module]), and the export-keyword is produced by preprocessing either of the previous two directives.

[Note 2:

None has any observable spelling.

— end note]

If the input stream has been parsed into preprocessing tokens up to a given character:

  • If the next character begins a sequence of characters that could be the prefix and initial double quote of a raw string literal, such as R", the next preprocessing token shall be a raw string literal.

    Between the initial and final double quote characters of the raw string, any transformations performed in phase 2 (line splicing) are reverted; this reversion is applied before any d-char, r-char, or delimiting parenthesis is identified.

    The raw string literal is defined as the shortest sequence of characters that matches the raw-string pattern

  • Otherwise, if the next three characters are <​::​ and the subsequent character is neither : nor >, the < is treated as a preprocessing token by itself and not as the first character of the alternative token <:.

  • Otherwise, if the next three characters are [​::​ and the subsequent character is not :, or if the next three characters are [:>, the [ is treated as a preprocessing token by itself and not as the first character of the preprocessing token [:.

    [Note 3:

    The tokens [: and :] cannot be composed from digraphs.

    — end note]

  • Otherwise, the next preprocessing token is the longest sequence of characters that could constitute a preprocessing token, even if that would cause further lexical analysis to fail, except that

    • a string-literal token is never formed when a token can be formed, and
    • a ([lex.header]) is only formed
      • immediately after the include, embed, or import preprocessing token in a #include ([cpp.include]), #embed ([cpp.embed]), or import ([cpp.import]) directive, respectively, or
      • immediately after a preprocessing token sequence of __has_include or __has_embed immediately followed by ( in a #if, #elif, or #embed directive ([cpp.cond], [cpp.embed]).

[Example 1: #define R "x" const char* s = R"y"; — end example]

[Example 2:

The program fragment 0xe+foo is parsed as a preprocessing number token (one that is not a valid integer-literal or floating-point-literal token), even though a parse as three preprocessing tokens 0xe, +, and foo can produce a valid expression (for example, if foo is a macro defined as 1).

Similarly, the program fragment 1E1 is parsed as a preprocessing number (one that is a valid floating-point-literal token), whether or not E is a macro name.

— end example]

[Example 3:

The program fragment x+++++y is parsed as x ++ ++ + y, which, if x and y have integral types, violates a constraint on increment operators, even though the parse x ++ + ++ y can yield a correct expression.

— end example]