cpp/regex/ecmascript

This page describes the regular expression grammar that is used when std is constructed with set to  (the default). See for the other supported regular expression grammars.

The 3 regular expression grammar in C++ is ECMA-262 grammar with modifications marked with  below.

Overview
The modified regular expression grammar is mostly ECMAScript RegExp grammar with a POSIX-type expansion on locales under ClassAtom. Some clarifications on equality checks and number parsing is made. For many of the examples here, you can try this equivalent in your browser console:

The "normative references" in the standard specifies ECMAScript 3. We link to the ECMAScript 5.1 spec here because it is a version with only minor changes from ECMAScript 3, and it also has an HTML version. See the MDN Guide on JavaScript RegExp for an overview on the dialect features.

Alternatives
A regular expression pattern is sequence of one or more Alternatives, separated by the disjunction operator (in other words, the disjunction operator has the lowest precedence)

Pattern ::
 * Disjunction

Disjunction ::
 * Alternative
 * Alternative Disjunction

The pattern first tries to skip the Disjunction and match the left Alternative followed by the rest of the regular expression (after the Disjunction).

If it fails, it tries to skip the left Alternative and match the right Disjunction (followed by the rest of the regular expression).

If the left Alternative, the right Disjunction, and the remainder of the regular expression all have choice points, all choices in the remainder of the expression are tried before moving on to the next choice in the left Alternative. If choices in the left Alternative are exhausted, the right Disjunction is tried instead of the left Alternative.

Any capturing parentheses inside a skipped Alternative produce empty submatches.

Terms
Each Alternative is either empty or is a sequence of Terms (with no separators between the Terms)

Alternative ::
 * [empty]
 * Alternative Term

Empty Alternative always matches and does not consume any input.

Consecutive Terms try to simultaneously match consecutive portions of the input.

If the left Alternative, the right Term, and the remainder of the regular expression all have choice points, all choices in the remained of the expression are tried before moving on to the next choice in the right Term, and all choices in the right Term are tried before moving on to the next choice in the left Alternative.

Quantifiers

 * Each Term is either an Assertion (see below), or an Atom (see below), or an Atom immediately followed by a Quantifier

Term ::
 * Assertion
 * Atom
 * Atom Quantifier

Each Quantifier is either a greedy quantifier (which consists of just one QuantifierPrefix) or a non-greedy quantifier (which consists of one QuantifierPrefix followed by the question mark ).

Quantifier ::
 * QuantifierPrefix
 * QuantifierPrefix

Each QuantifierPrefix determines two numbers: the minimum number of repetitions and the maximum number of repetitions, as follows:

The values of the individual DecimalDigits are obtained by calling std on each of the digits.

An Atom followed by a Quantifier is repeated the number of times specified by the Quantifier. A Quantifier can be non-greedy, in which case the Atom pattern is repeated as few times as possible while still matching the remainder of the regular expression, or it can be greedy, in which case the Atom pattern is repeated as many times as possible while still matching the remainder of the regular expression.

The Atom pattern is what is repeated, not the input that it matches, so different repetitions of the Atom can match different input substrings.

If the Atom and the remainder of the regular expression all have choice points, the Atom is first matched as many (or as few, if non-greedy) times as possible. All choices in the remainder of the regular expression are tried before moving on to the next choice in the last repetition of Atom. All choices in the last (nth) repetition of Atom are tried before moving on to the next choice in the next-to-last (n–1)st repetition of Atom; at which point it may turn out that more or fewer repetitions of Atom are now possible; these are exhausted (again, starting with either as few or as many as possible) before moving on to the next choice in the (n-1)st repetition of Atom and so on.

The Atom's captures are cleared each time it is repeated (see the example below)

Assertions
Assertions match conditions, rather than substrings of the input string. They never consume any characters from the input. Each Assertion is one of the following

Assertion ::
 * Disjunction
 * Disjunction
 * Disjunction
 * Disjunction
 * Disjunction
 * Disjunction

The assertion (beginning of line) matches @1@ The position that immediately follows a LineTerminator character @2@ The beginning of the input (unless std is enabled)

The assertion (end of line) matches @1@ The position of a LineTerminator character @2@ The end of the input (unless std is enabled)

In the two assertions above and in the Atom below, LineTerminator is one of the following four characters:  ( or line feed),  ( or carriage return),  (line separator), or  (paragraph separator)

The assertion (word boundary) matches @1@ The beginning of a word (current character is a letter, digit, or underscore, and the previous character is not) @2@ The end of a word (current character is not a letter, digit, or underscore, and the previous character is one of those) @3@ The beginning of input if the first character is a letter, digit, or underscore (unless std is enabled) @4@ The end of input if the last character is a letter, digit, or underscore (unless std is enabled)

The assertion (negative word boundary) matches everything EXCEPT the following @1@ The beginning of a word (current character is a letter, digit, or underscore, and the previous character is not one of those or does not exist) @2@ The end of a word (current character is not a letter, digit, or underscore (or the matcher is at the end of input), and the previous character is one of those)

The assertion   Disjunction  (zero-width positive lookahead) matches if Disjunction would match the input at the current position

The assertion   Disjunction  (zero-width negative lookahead) matches if Disjunction would NOT match the input at the current position.

For both Lookahead assertions, when matching the Disjunction, the position is not advanced before matching the remainder of the regular expression. Also, if Disjunction can match at the current position in several ways, only the first one is tried.

ECMAScript forbids backtracking into the lookahead Disjunctions, which affects the behavior of backreferences into a positive lookahead from the remainder of the regular expression (see example below). Backreferences into the negative lookahead from the rest of the regular expression are always undefined (since the lookahead Disjunction must fail to proceed).

Note: Lookahead assertions may be used to create logical AND between multiple regular expressions (see example below).

Atoms
An Atom can be one of the following:

Atom ::
 * PatternCharacter
 * AtomEscape
 * CharacterClass
 * Disjunction
 * Disjunction
 * Disjunction

where AtomEscape ::
 * DecimalEscape
 * CharacterEscape
 * CharacterClassEscape

Different kinds of atoms evaluate differently.

Sub-expressions
The Atom Disjunction  is a marked subexpression: it executes the Disjunction and stores the copy of the input substring that was consumed by Disjunction in the submatch array at the index that corresponds to the number of times the left open parenthesis  of marked subexpressions has been encountered in the entire regular expression at this point.

Besides being returned in the std, the captured submatches are accessible as backreferences (,, ...) and can be referenced in regular expressions. Note that std uses  instead of  for backreferences (,, ...) in the same manner as  (ECMA-262, part 15.5.4.11).

The Atom   Disjunction  (non-marking subexpression) simply evaluates the Disjunction and does not store its results in the submatch. This is a purely lexical grouping.

Backreferences
DecimalEscape ::
 * DecimalIntegerLiteral [lookahead ∉ DecimalDigit]

If is followed by a decimal number  whose first digit is not, then the escape sequence is considered to be a backreference. The value is obtained by calling std on each of the digits and combining their results using base-10 arithmetic. It is an error if is greater than the total number of left capturing parentheses in the entire regular expression.

When a backreference appears as an Atom, it matches the same substring as what is currently stored in the N'th element of the submatch array.

The decimal escape is NOT a backreference: it is a character escape that represents the   character. It cannot be followed by a decimal digit.

As above, note that std uses  instead of  for backreferences (,, ...).

Single character matches
The Atom matches and consumes any one character from the input string except for LineTerminator (,, , or )

The Atom PatternCharacter, where PatternCharacter is any SourceCharacter EXCEPT the characters, matches and consumes one character from the input if it is equal to this PatternCharacter.

The equality for this and all other single character matches is defined as follows: @1@ If std is set, the characters are equal if the return values of std are equal. @2@ Otherwise, if std is set, the characters are equal if the return values of std are equal. @3@ Otherwise, the characters are equal if returns.

Each Atom that consists of the escape character followed by CharacterEscape as well as the special DecimalEscape, matches and consumes one character from the input if it is equal to the character represented by the CharacterEscape. The following character escape sequences are recognized:

CharacterEscape ::
 * ControlEscape
 * ControlLetter
 * HexEscapeSequence
 * UnicodeEscapeSequence
 * IdentityEscape

Here, ControlEscape is one of the following five characters:

ControlLetter is any lowercase or uppercase ASCII letters and this character escape matches the character whose code unit equals the remainder of dividing the value of the code unit of ControlLetter by. For example, and  both match code unit  (EOT) because 'D' is  and, and 'd' is  and.

HexEscapeSequence is the letter followed by exactly two HexDigits (where HexDigit is one of ). This character escape matches the character whose code unit equals the numeric value of the two-digit hexadecimal number.

UnicodeEscapeSequence is the letter followed by exactly four HexDigits. This character escape matches the character whose code unit equals the numeric value of this four-digit hexadecimal number. If the value does not fit in this std's, std is thrown.

IdentityEscape can be any non-alphanumeric character: for example, another backslash. It matches the character as-is.

Character classes
An Atom can represent a character class, that is, it will match and consume one character if it belongs to one of the predefined groups of characters.

A character class can be introduced through a character class escape:

Atom ::
 * CharacterClassEscape

or directly

Atom ::
 * CharacterClass

The character class escapes are shorthands for some of the common characters classes, as follows:

A CharacterClass is a bracket-enclosed sequence of ClassRanges, optionally beginning with the negation operator. If it begins with, this Atom matches any character that is NOT in the set of characters represented by the union of all ClassRanges. Otherwise, this Atom matches any character that IS in the set of the characters represented by the union of all ClassRanges.

CharacterClass ::
 * lookahead ∉ {}] ClassRanges
 * ClassRanges

ClassRanges ::
 * [empty]
 * NonemptyClassRanges

NonemptyClassRanges ::
 * ClassAtom
 * ClassAtom NonemptyClassRangesNoDash
 * ClassAtom - ClassAtom ClassRanges

If non-empty class range has the form, it matches any character from a range defined as follows:

The first ClassAtom must match a single collating element and the second ClassAtom must match a single collating element. To test if the input character is matched by this range, the following steps are taken: @1@ If std is not on, the character is matched by direct comparison of code points: is matched if @1@ Otherwise (if std is enabled):
 * @1@ If std is enabled, all three characters (,, and ) are passed std
 * @2@ Otherwise (if std is not set), all three characters (,, and ) are passed std

@2@ The resulting strings are compared using std and the character is matched if

The character is treated literally if it is
 * the first or last character of ClassRanges
 * the beginning or end ClassAtom of a dash-separated range specification
 * immediately follows a dash-separated range specification.
 * escaped with a backslash as a CharacterEscape

NonemptyClassRangesNoDash ::
 * ClassAtom
 * ClassAtomNoDash NonemptyClassRangesNoDash
 * ClassAtomNoDash - ClassAtom ClassRanges

ClassAtom ::
 * ClassAtomNoDash
 * ClassAtomExClass
 * ClassAtomCollatingElement
 * ClassAtomEquivalence
 * ClassAtomEquivalence

ClassAtomNoDash ::
 * SourceCharacter but not one of
 * ClassEscape

Each ClassAtomNoDash represents a single character -- either SourceCharacter as-is or escaped as follows:

ClassEscape ::
 * DecimalEscape
 * CharacterEscape
 * CharacterClassEscape
 * CharacterClassEscape

The special ClassEscape produces a character set that matches the code unit U+0008 (backspace). Outside of CharacterClass, it is the word-boundary Assertion.

The use of and the use of any backreference (DecimalEscape other than zero) inside a CharacterClass is an error.

The characters and  may need to be escaped in some situations in order to be treated as atoms. Other characters that have special meaning outside of CharacterClass, such as or, do not need to be escaped.

POSIX-based character classes
These character classes are an extension to the ECMAScript grammar, and are equivalent to character classes found in the POSIX regular expressions.

ClassAtomExClass ::
 * ClassName

Represents all characters that are members of the named character class ClassName. The name is valid only if std returns non-zero for this name. As described in std, the following names are guaranteed to be recognized:. Additional names may be provided by system-supplied locales (such as or  in Japanese) or implemented as a user-defined extension.

ClassAtomCollatingElement ::
 * ClassName

Represents the named collating element, which may represent a single character or a sequence of characters that collates as a single unit under the imbued locale, such as or  in Czech. The name is valid only if std is not an empty string.

When using std, collating elements can always be used as ends points of a range (e.g. in Hungarian).

ClassAtomEquivalence ::
 * ClassName

Represents all characters that are members of the same equivalence class as the named collating element, that is, all characters whose whose primary collation key is the same as that for collating element ClassName. The name is valid only if std for that name is not an empty string and if the value returned by std for the result of the call to std is not an empty string.

A primary sort key is one that ignores case, accentation, or locale-specific tailorings; so for example matches any of the characters:

ClassName ::
 * ClassNameCharacter
 * ClassNameCharacter ClassName

ClassNameCharacter ::
 * SourceCharacter but not one of