Regular Expressions
Regular expressions are a way of specifying text search or match criteria that allow alternate strings and repeated sequences inside the template. This can be used
+ to determine whether a string follows specific rules,
+ to find a string within text that fits a specified pattern, or
+ to split text into lexical elements (tokens).
First some definitions:
A (character) string is a finite sequence of characters, the number of characters in the string is called the length of the string.
A (formal) language is a set of strings in which all the characters in all the strings are members of a finite set (often called the alphabet of the language). Note that although the length each string is finite and the number of distinct characters in all the strings in the language is finite, the language may contain an infinite number of strings.
Now it starts getting messy. A regular expression describes the strings in a language (for some languages). A regular expression contains
+ characters from the alphabet of the language, and
+ Syntactic elements, which can include
- repetition, often specified by an asterisk (*) or a plus sign (+),
- alternative (or choice), often specified by a vertical bar (|),
- grouping, surrounding groups by pairs of symbols like parentheses, ( and ) or square
brackets ( [ and ] ), or by the same symbol before and after the group, like apostrophes ('),
and quotation marks ("); any program that allows the full range of regular expressions must
provide for repetition, alternatives, and grouping,
- the null string, often represented in textbooks by an upper case Greek lambda (which looks
like a peaked roof or the front of a tent), a caret (^) will be used here,
- classes of characters, such as letter, digit, whitespace,
- other symbols provided for in the program that processes the regular expression.
Read more »