Perl Regular Expressions

regular expressions - mainly pcre (Perl Compatible Regular Expressions)

2010-04-12: Moved 'Regular Expressions' (regex) to its own page.

I have only listed some things here, as this is not meant to be very comprehensive. There are LOTS of online references - this is a simple 'cheat sheet' from is an example.

Meta-characters - These characters have special meanings
char(s) description example finds


start of target; in multi-line, after each new-line ^abc abc, abcdefg, abc123, but not babc


end of string; in multi-line, before each new-line abc$ abc, endabc, 123abc, but not abcd


any character, but not new-line, unless multi-line a.c abc, aac, acc, etc


alternatives bill|ted bill or ted


explicit quantifier (count) notation ab{2}c abbc


explicit class, set of characters to match a[bB]c abc and aBc


logical grouping of part of an expression (abc){2} abcabc


0 or more of previous expression ab*c ac, abc, abbc, abbbc


1 or more of previous expression ab+c abc, abbc, abbbc


0 or 1 of previous; also minimal matching ab?c ac, abc


Preceding one of above makes it literal a\*b a*b

Thus 'ordinary characters' are anything other than ^ $ . | { } [ ] ( ) * + ? \

The backslash, '\' not only converts the above meta characters to their literal meaning, but this 'escape' character, followed by one of the following also has a special meaning, or character class ...

Some Character Classes
char description char description


alphanumeric, including _




white space


non-white space


numeric (digit)




beginning of the string


end of string


word boundaries



\n, \r, \f, \t etc, have their usual meaning, namely CR (0x0d), LF (0x0a), FF (0x0c), and TAB (0x09). Others, depending on the implementation, are \a (bell/alarm 0x07), \b (backspace 0x08), \v (vertical tab 0x0b), \e (escape 0x1b), \040 (ASCII character as OCTAL), \x20 (ASCII character using hexadecimal notation - 2 digits), \cC (control-C), and \u0020 (Unicode character using hexadecimal notation) ... but as stated, this varies with implementation ... see - - for more ...

more on character classes
class description


matches any single character included in the specified set of characters


matches any single character not in the specified set of characters


Use of a hyphen (-) allows specification of a contiguous character range


Matches any character in the named character class specified by {name}. Supported names are Unicode groups and block ranges. For example, Ll, Nd, Z, IsGreek, IsBoxDrawing.


Matches text not included in groups and block ranges specified in {name}.


is equivalent to \w shown above
See - for more.

Other References

Other Regular Expression (regex) references found - some sites found using 'regular expression' search in Yahoo!

Some examples of regex found -

from :
email address: \b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b
and : 
or : ^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}$
or : ^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.(?:[A-Z]{2}|com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum)$
and the massive:  RFC 2822 : (?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])
practical implementation of RFC 2822 : [a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?
allow any two-letter country code top level domain : [a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+(?:[A-Z]{2}|com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum)\b
and : 
match the opening and closing pair of any HTML tag : <([A-Z][A-Z0-9]*)\b[^>]*>(.*?)</\1>
IP Address: \b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b
But to avoid say : 999.999.999.999 : \b(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b


from :
Y2K Date: (4 digit year) : ($line1 =~ m/[0-9]{2}[\/|-][0-9]{2}[\/|-][0-9]{4}/)

from :
Uniform Resource Identifier (URI) breakdown :
my $uri = "";
print "$1, $2, $3, $4, $5, $6, $7, $8, $9" if $uri =~

$1 = http: $2 = http (the scheme) $3 = // $4 = (the authority) $5 = /pub/ietf/uri/ (the path) $6 = $7 = (the query) $8 = #Related $9 = Related (the fragment)


