Perl Regular Expressions

external: back -|- index -|- samples
regex references examples end

regular expressions - mainly pcre (Perl Compatible Regular Expressions)

2010-04-12: Moved 'Regular Expressions' (regex) to its own page.

I have only listed some things here, as this is not meant to be very comprehensive. There are LOTS of online references - this is a simple 'cheat sheet' from http://regexlib.com/ is an example.

Meta-characters - These characters have special meanings
char(s) description example finds

^

start of target; in multi-line, after each new-line ^abc abc, abcdefg, abc123, but not babc

$

end of string; in multi-line, before each new-line abc$ abc, endabc, 123abc, but not abcd

.

any character, but not new-line, unless multi-line a.c abc, aac, acc, etc

|

alternatives bill|ted bill or ted

{...}

explicit quantifier (count) notation ab{2}c abbc

[...]

explicit class, set of characters to match a[bB]c abc and aBc

(...)

logical grouping of part of an expression (abc){2} abcabc

*

0 or more of previous expression ab*c ac, abc, abbc, abbbc

+

1 or more of previous expression ab+c abc, abbc, abbbc

?

0 or 1 of previous; also minimal matching ab?c ac, abc

\

Preceding one of above makes it literal a\*b a*b

Thus 'ordinary characters' are anything other than ^ $ . | { } [ ] ( ) * + ? \

The backslash, '\' not only converts the above meta characters to their literal meaning, but this 'escape' character, followed by one of the following also has a special meaning, or character class ...

Some Character Classes
char description char description

\w

alphanumeric, including _

\W

non-alphanumeric

\s

white space

\S

non-white space

\d

numeric (digit)

\D

non-numeric

\A

beginning of the string

\Z

end of string

\b

word boundaries

\B

non-boundaries

\n, \r, \f, \t etc, have their usual meaning, namely CR (0x0d), LF (0x0a), FF (0x0c), and TAB (0x09). Others, depending on the implementation, are \a (bell/alarm 0x07), \b (backspace 0x08), \v (vertical tab 0x0b), \e (escape 0x1b), \040 (ASCII character as OCTAL), \x20 (ASCII character using hexadecimal notation - 2 digits), \cC (control-C), and \u0020 (Unicode character using hexadecimal notation) ... but as stated, this varies with implementation ... see - http://msdn.microsoft.com/library/default.asp?url=/library/en-us/cpgenref/html/cpconcharacterescapes.asp - for more ...

more on character classes
class description

[aeiou]

matches any single character included in the specified set of characters

[^aeiou]

matches any single character not in the specified set of characters

[0-9a-fA-F]

Use of a hyphen (-) allows specification of a contiguous character range

\p{name}

Matches any character in the named character class specified by {name}. Supported names are Unicode groups and block ranges. For example, Ll, Nd, Z, IsGreek, IsBoxDrawing.

\P{name}

Matches text not included in groups and block ranges specified in {name}.

[a-zA-Z_0-9]

is equivalent to \w shown above
See - http://msdn.microsoft.com/library/default.asp?url=/library/en-us/cpgenref/html/cpconcharacterclasses.asp for more.

back -|- top -|- index -|- samples


Other References

Other Regular Expression (regex) references found - some sites found using 'regular expression' search in Yahoo!

back -|- top -|- index -|- samples


Examples

Some examples of regex found -

from : http://www.regular-expressions.info/
email address: \b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b
and : http://www.regular-expressions.info/email.html 
or : ^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}$
or : ^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.(?:[A-Z]{2}|com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum)$
and the massive:  RFC 2822 : (?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])
practical implementation of RFC 2822 : [a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?
allow any two-letter country code top level domain : [a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+(?:[A-Z]{2}|com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum)\b
and : http://www.regular-expressions.info/examples.html 
<TAG\b[^>]*>(.*?)</TAG>
match the opening and closing pair of any HTML tag : <([A-Z][A-Z0-9]*)\b[^>]*>(.*?)</\1>
IP Address: \b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b
But to avoid say : 999.999.999.999 : \b(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b

 

from : http://www.blazonry.com/perl/regexp_exs.php
Y2K Date: (4 digit year) : ($line1 =~ m/[0-9]{2}[\/|-][0-9]{2}[\/|-][0-9]{4}/)

from : http://www.wilsonmar.com/1regex.htm
Uniform Resource Identifier (URI) breakdown :
my $uri = "http://www.ics.uci.edu/pub/ietf/uri/#Related";
print "$1, $2, $3, $4, $5, $6, $7, $8, $9" if $uri =~
m{^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?};

$1 = http: $2 = http (the scheme) $3 = //www.ics.uci.edu $4 = www.ics.uci.edu (the authority) $5 = /pub/ietf/uri/ (the path) $6 = $7 = (the query) $8 = #Related $9 = Related (the fragment)

 

back -|- top -|- index -|- samples


checked by tidy  Valid HTML 4.01 Transitional