wxRegEx Class—wxWidgets Regular Expressions
wxWidgets 3.1.5 and prior used the exact same regular expression engine that was developed by Henry Spencer for Tcl 8.2. This means that wxWidgets 3.1.5 and prior supported the same three regular expressions flavors: Tcl Advanced Regular Expressions, POSIX Extended Regular Expressions and POSIX Basic Regular Expressions. Unlike in Tcl, EREs were the default rather than the far more powerful AREs. You could select the ARE flavor with wxRE_ADVANCED and the BRE flavor with wxRE_BASIC. When reading the regular expressions tutorial on this website, if you are still using wxWidgets 3.1.5 or prior then follow what is written for Tcl when using wxRE_ADVANCED. Follow what is written for POSIX BRE when using wxRE_BASIC or what is written for POSIX ERE when using wxRE_EXTENDED or none of these 3 flags.
wxWidgets 3.1.6 swapped its regular expression engine to PCRE2 10.37. wxWidgets 3.3.0 upgraded to PCRE2 10.45. There are significant differences between Tcl’s regex engine and PCRE2. The claims made by the wxRegEx class reference in the wxWidgets 3.1.6 documentation that the differences are slight and that wxEXTENDED and wxADVANCED are synonyms now are not true. When migrating code from wxWidgets 3.1.5 or prior to wxWidgets 3.1.6 or later you should make sure to test all your regexes again.
The wxClass library actually tries to mask some of those differences depending on which flavor flag your code uses. If you set wxRE_EXTENDED (or no flag at all because it’s still the default) then wxRegEx uses the exact same regex flavor as PCRE2 10.37 or 10.45. You can then follow what is said about PCRE2 in the regex tutorial. This is what you should do in all new code.
If you set wxADVANCED then wxRegEx modifies your regex before passing it to the underlying PCRE2 engine. It converts Tcl word boundaries to PCRE2 word boundaries throughout the regex. If the regex starts with a mode modifier then it converts the Tcl modifier letters to PCRE2 modifier letters. This means that with wxADVANCED, some modifier letters have a different meaning in a mode modifier at the start of the regex than in a mode modifier in the middle of a regex or in a modifier span.
If you set wxBASIC then wxRegEx manipulates the backslashes in your regular expression to convert the POSIX BRE to PCRE2. But it doesn’t disable PCRE2 syntax. So the result is that wxBASIC produces a new regex flavor that is a mashup of BRE-style quantifiers and grouping with nearly the full set of features of PCRE2. New code should not use wxBASIC.
wxWidgets Replacement Text Syntax
The wxRegEx::Replace() method used the same syntax for the replacement text as Tcl’s regsub command. When reading the replacement text tutorial and reference, follow what is said about Tcl. The move to PCRE2 has not changed this.
The wxRegEx Class
To use the wxWidgets regex engine, you need to instantiate the wxRegEx class. The class has two constructors. wxRegEx() creates an empty regex object. Before you can use the object, you have to call wxRegEx::Compile(). wxRegEx::IsValid will return false until you do.
wxRegEx(const wxString& expr, int flags = wxRE_EXTENDED) creates a wxRegEx object with a compiled regular expression. The constructor will always create the object, even if your regular expression is invalid. Check wxRegEx::IsValid to determine if the regular expression was compiled successfully.
bool wxRegEx::Compile(const wxString& pattern, int flags = wxRE_EXTENDED) compiles a regular expression. You can call this method on any wxRegEx object, including one that already holds a compiled regular expression. Doing so will simply replace the regular expression held by the wxRegEx object. Pass your regular expression as a string as the first parameter. The second parameter allows you to set certain matching options.
The flags argument can be set to 0 or omitted to use the default modes. You can set multiple modes by combining the flags with bitwise or.
You can include one of wxRE_EXTENDED, wxRE_ADVANCED, or wxRE_BASIC to select the regex flavor. The default is wxRE_EXTENDED. wxRE_ICASE makes the regular expression case insensitive. The default is case sensitive.
wxRE_NOSUB tells wxRegEx not to retrieve capturing groups after the match. This flag is not equivalent to PCRE2_NO_AUTO_CAPTURE. It does not turn capturing groups into non-capturing groups. It does not change how the regex works. It only means you won’t be able to use backreferences in the replacement text, or query the part of the regex matched by each capturing group. If you won’t be using these anyway, setting the wxRE_NOSUB flag slightly improves performance.
Setting the wxRE_NEWLINE flag is equivalent to using “newline-sensitive matching” or (?n) with Tcl’s regex engine. It is equivalent to setting PCRE2_MULTILINE or using (?m) with PCRE2. In this mode, the dot does not match line breaks and the caret and dollar match before and after line breaks. Omitting the wxRE_NEWLINE flag is equivalent to using “non-newline-sensitive matching” or (?s) with Tcl’s regex engine. It is equivalent to setting PCRE2_DOTALL or using (?s) with PCRE2. In this mode, the dot matches line breaks and the caret and dollar only match at the start and end of the string.
The wxRE_NOTEMPTY flag is new in wxWidgets 3.1.6. It tells PCRE2 to skip zero-length matches by backtracking.
wxRegEx Status Functions
wxRegEx::IsValid() returns true when the wxRegEx object holds a compiled regular expression.
wxRegEx::GetMatchCount() is rather poorly named. It does not return the number of matches found by Matches(). In fact, you can call GetMatchCount() right after Compile(), before you call Matches. GetMatchCount() it returns the number of capturing groups in your regular expression, plus one for the overall regex match. You can use this to determine the number of backreferences you can use the replacement text, and the highest index you can pass to GetMatch(). If your regex has no capturing groups, GetMatchCount() returns 1. In that case, \0 is the only valid backreference you can use in the replacement text.
GetMatchCount() returns 0 in case of an error. This will happen if the wxRegEx object does not hold a compiled regular expression, or if you compiled it with wxRE_NOSUB.
Finding and Extracting Matches
If you want to test whether a regex matches a string, or extract the substring matched by the regex, you first need to call the wxRegEx::Matches() method. It has 3 variants, allowing you to pass wxChar or wxString as the subject string. When using a wxChar, you can specify the length as a third parameter. If you don’t, wxStrLen() will be called to compute the length. If you plan to loop over all regex matches in a string, you should call wxStrLen() yourself outside the loop and pass the result to wxRegEx::Matches().
bool wxRegEx::Matches(const wxChar* text, int flags = 0) const
bool wxRegEx::Matches(const wxChar* text, int flags, size_t len) const
bool wxRegEx::Matches(const wxString& text, int flags = 0) const
Matches() returns true if the regex matches all or part of the subject string that you passed in the text parameter. Add anchors to your regex if you want to set whether the regex matches the whole subject string.
Do not confuse the flags parameter with the one you pass to the Compile() method or the wxRegEx() constructor. All the flavor and matching mode options can only be set when compiling the regex.
The Matches() method allows only two flags: wxRE_NOTBOL and wxRE_NOTEOL. If you set wxRE_NOTBOL, then ^ and \A will not match at the start of the string. They will still match after embedded newlines if you turned on that matching mode. Likewise, specifying wxRE_NOTEOL tells $ and \Z not to match at the end of the string.
wxRE_NOTBOL is commonly used to implement a “find next” routine. The wxRegEx class does not provide such a function. To find the second match in the string, you’ll need to call wxRegEx::Matches() and pass it the part of the original subject string after the first match. Pass the wxRE_NOTBOL flag to indicate that you’ve cut off the start of the string you’re passing.
wxRE_NOTEOL can be useful if you’re processing a large set of data, and you want to apply the regex before you’ve read the whole data. Pass wxRE_NOTEOL while calling wxRegEx::Matches() as long as you haven’t read the entire string yet. Pass both wxRE_NOTBOL and wxRE_NOTEOL when doing a “find next” on incomplete data.
After a call to Matches() returns true, and you compiled your regex without the wxRE_NOSUB flag, you can call GetMatch() to get details about the overall regex match, and the parts of the string matched by the capturing groups in your regex.
bool wxRegEx::GetMatch(size_t* start, size_t* len, size_t index = 0) const retrieves the starting position of the match in the subject string, and the number of characters in the match.
wxString wxRegEx::GetMatch(const wxString& text, size_t index = 0) const returns the text that was matched.
For both calls, set the index parameter to zero (or omit it) to get the overall regex match. Set 1 <= index < GetMatchCount() to get the match of a capturing group in your regular expression. To determine the number of a group, count the opening brackets in your regular expression from left to right.
Searching and Replacing
The wxRegEx class offers three methods to do a search-and-replace. Replace() is the method that does the actual work. You can use ReplaceAll() and ReplaceFirst() as more readable ways to specify the 3rd parameter to Replace().
int wxRegEx::ReplaceAll(wxString* text, const wxString& replacement) const replaces all regex matches in text with replacement.
int wxRegEx::ReplaceFirst(wxString* text, const wxString& replacement) const replaces the first match of the regular expression in text with replacement.
int wxRegEx::Replace(wxString* text, const wxString& replacement, size_t maxMatches = 0) const allows you to specify how many replacements will be made. Passing 0 for maxMatches or omitting it does the same as ReplaceAll(). Setting it to 1 does the same as ReplaceFirst(). Pass a number greater than 1 to replace only the first maxMatches matches. If text contains fewer matches than you’ve asked for, then all matches will be replaced, without triggering an error.
All three calls return the actual number of replacements made. They return zero if the regex failed to match the subject text. A return value of -1 indicates an error. The replacements are made directly to the wxString that you pass as the first parameter.
wxWidgets uses the same syntax as Tcl for the replacement text. This is true even for wxWidgets 3.1.6 and later that use PCRE2 as their regex engine. You can use \0 as a placeholder for the whole regex match, and \1 through \9 for the text matched by one of the first nine capturing groups. You can also use & as a synonym of \0. Note that there’s no backslash in front of the ampersand. & is substituted with the whole regex match, while \& is substituted with a literal ampersand. Use \\ to insert a literal backslash. You only need to escape backslashes if they’re followed by a digit, to prevent the combination from being seen as a backreference. When specifying the replacement text as a literal string in C++ code, you need to double up all the backslashes, as the C++ compiler also treats backslashes as escape characters. So if you want to replace the match with the first backreference followed by the text &co, you’ll need to code that in C++ as _T("\\1\\&co").
