Character matching


Any character matches itself. A string of characters matches that string

. Matches any one character (except newline)
   
[group] Any character in the group
[^group] Any character not in the group
[first-last] Any character between first and last, inclusive
[group-[group]] Subtracts one character class from another
   
\p{name}, \P{name} Any character in the Unicode category or block, Any character not in the Unicode category or block 
\t A tab character
\w, \W Any word character, any non-word character
\s, \S Any whitespace character, any non-whitespace character
\d, \D Any (decimal) digit, any non-digit
\b, \B Boundary between \w and \W, non-boundary
^, $ Start, end of input line
   
a|b Matches a or b
cat|dog Matches exactly "cat" or "dog"

 

Character matching examples:

(c|d)og Matches "cog" or "dog"
[cd]og Matches "cog" or "dog"
[a-zA-Z] Matches any English letter, capital or lower-case
[^lo]og Matches lots of things, including "dog" and "&og", but not "log" or "oog"
[c-f]og Matches "cog", "dog", or "fog"
[b-h-[d-g]]og The first part is [bcdefgh], with [defg] removed, so this matches [bch]og, or "bog", "cog", and "hog"
\brat\b The \b means a word break, so this matches "rat" but not "brat" or "rats"


Quantifiers

exp Matches exp exactly once
exp? Matches exp 0 or 1 times
exp+ Matches exp one or more times
exp* Matches exp zero or more times
exp{n} Matches exp n times in a row
exp{min,mix} Matches exp between min and max times
exp{min,} Matches exp at least min times

Append a ? to a quantifier to make it a lazy match (as few as possible)

Quantifier examples:

Rat+ Matches "Rat", "Ratt", "Rattt", and so forth
shop(pe)? Matches "shop" and "shoppe"
bar\s*stool Matches "barstool", "bar stool", “bar     stool”, and any amount of space in between
\d{3} Matches any 3 digits in a row

(\d{3}[-.\s])?\d{3}[-.\s]\d{4}

Matches, in order:

  1. An optional first three digits, followed by a period, a dash, or whitespace
  2. Three digits followed by a period, a dash, or whitespace
  3. Four digits

This matches "123-6789" and "123 456.7890" but not "(123) 456.7890"

 


Grouping/Subexpressions

exp1|exp2 Matches exp1 or exp2
(exp) Matches exp and captures to a numbered group
(?<name>exp) Matches exp and captures to named group
(?'name'exp) Alternate syntax for named capture groups
(?:exp) Subexpression that does not capture to a group
\1 to \n Expression that matches the value of numbered group
\k<name> Expression that matches the value of named group

Subexpression examples:

(Mr|Ms|Miss|Mrs)\.?\b

Matches any of these prefixes with or without a period after them

(?<prefix>\b(Mr|Ms|Miss|Mrs)\.?\b)

Captures "Mr.", "Ms.", etc., and gives it the name "prefix".

Note  that the match will fail if no prefix is found. To make it optional, put a ? at the end. 

\bc(?'rhyme'[aeiou][^aeiou]{1,2}) in the h\k'rhyme'\b

Matches a c, then a vowel, then one or two non-vowels and calls it "rhyme" Uses the value of "rhyme" later in the match.

This will match "cat in the hat", "cog in the hog" and "cost in the host" (which doesn't actually rhyme)


Conditionals

(?(exp)expyes|expno) If exp matches, matches expyes.
If exp does not match, matches expno.

The test of exp does not change the match location, so exp should also match expyes or else the whole conditional will never match.
(?(name)expyes|expno) If the named capture group has a match, matches expyes, otherwise matches expno.

Conditional examples:

These examples have whitespace inserted into them for readability. They should be run with the /IPW flag.

(?(\d{2}-) \d{2}-\d{7}  | \d{3}-\d{2}-\d{4} )

The conditional tests if the next characters match "\d{2}-", or two digits followed by a dash. If so, it matches "\d{2}-\d{7}" (a US Employer Identification Number). If the next characters are not \d{2}-, it matches against \d{3}-\d{2}-\d{4} (a US Social Security Number)

\d+  ((?'feet'ft)|m) \s* \d+ (?(feet)in|cm)

This matches some digits followed by either "ft" or "m", then any spaces and more digits. Finally, if it matched "ft" into capture group "feet", it matches "in", otherwise it matches "cm"

When matched against...

5ft 6in Matches. The "f" is captured into the group "feet". After the space and digit, the subsequent conditional matches "in"
5m 9cm Matches. Since the group "feet" didn't capture anything, the conditional matches "cm"
5ft 8cm Does not match. The "f" causes the conditional to expect inches, not centimeters



Lookaround assertions


"Zero-width atomic lookaround assertions"
These match if a given expression is directly ahead or behind the match point (or, in the case of negative assertions, if the expression is not ahead or behind), but they do not move the match point.

(?=exp) Positive lookahead
(?!exp) Negative lookahead
(?<=exp) Positive lookbehind
(?<!exp) Negative lookbehind

These are commonly used when replacing to capture the piece to replace only if it precedes or follows some other expression.

Assertion examples:

\d+(?=\)) Matches a group of digits followed by a closing parentheses, but the parens is not part of the match
(?<=\()\d+ Matches a group of digits preceded by an opening parens, but the parens is not part of the match
(?<=\()\d+(?=\)) Combined, these match the numbers inside a pair of parentheses


Balancing

(?<end-start>exp) Matches exp, deletes the most-recent value for the capture group "start", and assigns the interval between where start matched and where exp matched to named capture group "end"
(?'end-start'exp) Alternate syntax for balancing groups

Balancing groups are too complex to explain in command-line help text, and not usually useful without code behind them, but the general example syntax for balancing groups (in this case, nesting on '<' and '>')  is:
^[^<>]*(((?'Open'<)[^<>]*)+((?'Close-Open'>)[^<>]*)+)*(?(Open)(?!))$

If you use balancing groups, you probably want to use the /Format parameter with the %gval operator.


Reference information

Unicode categories for /p{} and /P{}

L, Lu, Ll, Lt, Lm, Lo Letter: Any, Upper, Lower, Titlecase, Modifier, Other
M, Mn, Mc, Me Mark: Any, Nonspacing, Spacing combining, Enclosing
N, Nd, Nl, No Number: Any, Decimal digit, Letter, Other
P, Pc, Pd, Ps, Pe, Pi, Pf, Po Punctuation: Any, Connector, Dash, Open, Close, Initial quote, Final quote, Other
S, Sm, Sc, Sk, So Symbol: Any, Math, Currency, Modifier, Other
Z, Zs, Zl, Zp Separator: Any, Space, Line, Paragraph

Last edited Sep 19, 2012 at 9:11 PM by SethMorris, version 3

Comments

No comments yet.