Thursday, February 16, 2012

Regular Expressions

Regular expressions are a way of specifying text search or match criteria that allow alternate strings and repeated sequences inside the template.  This can be used
  +  to determine whether a string follows specific rules,
  +  to find a string within text that fits a specified pattern, or
  +  to split text into lexical elements (tokens).

First some definitions:

A (character) string is a finite sequence of characters, the number of characters in the string is called the length of the string.

A (formal) language is a set of strings in which all the characters in all the strings are members of a finite set (often called the alphabet of the language).  Note that although the length each string is finite and the number of distinct characters in all the strings in the language is finite, the language may contain an infinite number of strings.

Now it starts getting messy.  A regular expression describes the strings in a language (for some languages).  A regular expression contains
  +  characters from the alphabet of the language, and
  +  Syntactic elements, which can include
       -  repetition, often specified by an asterisk (*) or a plus sign (+),
       -  alternative (or choice), often specified by a vertical bar (|),
       -  grouping, surrounding groups by pairs of symbols like parentheses,  ( and ) or square
          brackets ( [ and ] ), or by the same symbol before and after the group, like apostrophes ('),
          and quotation marks ("); any program that allows the full range of regular expressions must
          provide for repetition, alternatives, and grouping,
       -  the null string, often represented in textbooks by an upper case Greek lambda (which looks
          like a peaked roof or the front of a tent), a caret (^) will be used here,
       -  classes of characters, such as letter, digit, whitespace,
       -  other symbols provided for in the program that processes the regular expression.
Note that the syntax for regular expressions presented here is not the syntax used by Unix (or Linux) commands. It is a variant of that used in some textbooks on formal languages.

A language that can be described by a regular expression is a regular (or type 3) language.

Whew!  Just one more thing (as TV and movie detectives are fond of saying): a regular expression that is used to specify part of a text string (a substring) has to be able to represent the beginning and end of the text string.  Here a dollar sign ($) will represent the beginning of the test string and a number sign (#) will represent the end of the text string.

Now for specific situations (in these examples lower case letters are characters of the language and upper case letters are classes of characters):

The regular expression for a language that contains a single string is that string, that is, if the language is {abcdefg}, then the regular expression is abcdefg.

The regular expression for a language that contains a finite number of strings with no characters in common is the strings in the language separated by vertical bars, that is, if the language is {abc, def, ghi}, then the regular expression is abc|def|ghi.

The regular expression for a language that contains strings with a single character repeated zero or more times (that is the null string and strings containing only the character) is the character followed by an asterisk, that is if the language is {^, a, aa, aaa, aaaa, ... }, then the regular expression is a*.

A language that contains a sequence of characters repeated one or more times can be described by the regular expression consisting of a group with the sequence of characters, and a plus sign following the group, that is, if the language is {ab, abab, ababab, abababab, ... }, then the regular expression can be (ab)+.  The regular expression ab(ab)* describes the same language.

Some programs allow the character strings in regular expressions to be enclosed in apostrophes or quotation marks; this is useful if spaces or syntactic characters are part of the string.  The language {abc, d f, ghi} (with a space in the second string) can be described by abc|"d f"|ghi.

Some programs allow classes of characters to be used when a choice among these characters is wanted.  If the class of digits is represented by D, the language containing strings of one or more digits can be described by D+.

Some programs use square brackets to indicate optional parts of a string; for example, if the language is {a, ab, aa, aab, aaa, aab, aaaa, aaaab, ... }, that is, one or more a's possibly followed by one b, it can be described by a*[b].

These elements can be combined to form complex regular expressions; for example, ["+"|"-"]D+ describes a language consisting of one or more digits optionally preceded by a plus or minus sign.

A common use of regular expressions is searching for strings in text.  The regular expression $abc|lmnop|xyz# will find the string abc at the beginning of the text, lmnop anywhere in the text, or xyz at the end of the text.

Double whew!  This has barely scratched the surface of what can be done with regular expressions.  Some programs allow you to match any string other than the ones specified -- here tilde (~) will be used to indicate "match anything but this" -- so that ~("+"|"-"|D) will match anything other than a plus sign, a minus sign, or a digit.  Some programs allow you to specify lower and upper bounds on the number of repetitions, so that you can specify at least 5 but no more than 10 repetitions of an element.  Some programs have extended regular expressions (so they are no longer regular) so that you can search for a string that matches a string found earlier in the search (for example, a string, zero or more characters, and the same string).

0 Comments:

Post a Comment

Please enter your comment here. Comments wil be reviewed before being published.

Subscribe to Post Comments [Atom]

<< Home