POSIX character classes

Navigation: TextEd > Regular expressions >

Perl supports the POSIX notation for character classes. This uses names enclosed by [: and :] within the enclosing square brackets. PCRE also supports this notation. For example,

[01[:alpha:]%]

matches "0", "1", any alphabetic character, or "%". The supported class names are:

alnum letters and digits

alpha letters

ascii character codes 0 - 127

blank space or tab only

cntrl control characters

digit decimal digits (same as \d)

graph printing characters, excluding space

lower lower case letters

print printing characters, including space

punct printing characters, excluding letters and digits and space

space white space (the same as \s from PCRE 8.34)

upper upper case letters

word "word" characters (same as \w)

xdigit hexadecimal digits

The default "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13), and space (32). If locale-specific matching is taking place, the list of space characters may be different; there may be fewer or more of them. "Space" used to be different to \s, which did not include VT, for Perl compatibility. However, Perl changed at release 5.18, and PCRE followed at release 8.34. "Space" and \s now match the same set of characters.

The name "word" is a Perl extension, and "blank" is a GNU extension from Perl 5.8. Another Perl extension is negation, which is indicated by a ^ character after the colon. For example,

[12[:^digit:]]

matches "1", "2", or any non-digit. PCRE (and Perl) also recognize the POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but these are not supported, and an error is given if they are encountered.

By default, characters with values greater than 128 do not match any of the POSIX character classes. However, if the PCRE_UCP option is passed to pcre_compile(), some of the classes are changed so that Unicode character properties are used. This is achieved by replacing certain POSIX classes by other sequences, as follows:

[:alnum:] becomes \p{Xan}

[:alpha:] becomes \p{L}

[:blank:] becomes \h

[:digit:] becomes \p{Nd}

[:lower:] becomes \p{Ll}

[:space:] becomes \p{Xps}

[:upper:] becomes \p{Lu}

[:word:] becomes \p{Xwd}

Negated versions, such as [:^alpha:] use \P instead of \p. Three other POSIX classes are handled specially in UCP mode:

[:graph:] This matches characters that have glyphs that mark the page when printed. In Unicode property terms, it matches all characters with the L, M, N, P, S, or Cf properties, except for:

U+061C Arabic Letter Mark

U+180E Mongolian Vowel Separator

U+2066 - U+2069 Various "isolate"s

[:print:] This matches the same characters as [:graph:] plus space characters that are not controls, that is, characters with the Zs property.

[:punct:] This matches all characters that have the Unicode P (punctuation) property, plus those characters whose code points are less than 128 that have the S (Symbol) property.

The other POSIX classes are unchanged, and match only characters with code points less than 128.

Philip Hazel

University Computing Service

Cambridge CB2 3QH, England.

Last updated: 12 November 2013