Handout 14

Regular Expressions*


Regular expressions are important in several places in Unix. Regular expressions are used to match patterns of characters.

The simplest Regular Expressions

If one wants to match a pattern of characters, then the simplest way to express the characters to be matched is just to put them down. For example, if you want to match the characters main, then you can use the regular expression:

main

Blanks are treated just like any other character in a regular expression and regular expressions are case-sensitive (uppercase and lowercase letters are different).

Metacharacters

Certain characters have special meanings when used in regular expressions. These are:

  \  ^  $  .  [  ]  *  +  ?  (  )  |

If you need to match a pattern that includes some of these characters, then you can force the character NOT to be treated as a metacharacter by placing the \ character in front of it.

For example, to match the characters $main, you must use the regular expression:

\$main

Using the . , ^, $, and * metacharacters

In a regular expression, the . (period or dot) character will take the place of any character. If you want to match the character patterns:

message1
message2
message8
messageX

You can simply use the regular expression:

message.

If you want to match a character pattern but only if it occurs at the beginning of a line, then place the ^ character before the pattern. For example, to match the word apple but only if it appears at the beginning of the line, you would use the regular expression:

^apple

If you want to match a character pattern but only if it occurs at the end of a line, then place a $ character after the pattern. For example, to match the word event but only if it occurs at the end of a line, you would use the regular expression:

event$

Note: The regular expression: ^$ will match any blank line.

The '*' character is used to match zero or more of the preceding character or regular expression.

For example, the regular expression:

file2*

will match any of the following:

file file2 file22

Using [ ] to match sets of characters

If you wanted to match the patterns

LwindowMargin
RwindowMargin

but not the patterns

TwindowMargin
BwindowMargin

you cannot use the tools we have so far except to list the acceptable patterns. The left and right bracket characters allow us to handle this case. These brackets are used to enclose the definition of a set of characters that we wish to match in a regular expression.

For example, to match any of the letters L or R, we can use the regular expression:

[LR]

We can use this now to get the regular expression we could use above:

[LR]windowMargin

The brackets can be used anywhere in a regular expression. For example to match any pattern that starts with the characters icon, followed by the numbers 1, 2, or 3, and the characters file, we could use the regular expression:

icon[123]file

There are some shorcuts that we can use with the brackets. We can specify a range of characters (successive ASCII codes) by using the - symbol. For example, if the numerical part of the previous example could be any number from 0 to 9, then the appropriate regular expression would be:

icon[0-9]file

You can have several components within the brackets. For example

[a-z123]

will match any lowercase letter or the digits 1, 2, or 3.

There is one more shortcut that you can use. Inside the brackets (but not outside), when you use the ^ character, it is interpreted to mean that you want the complement of the following characters within the brackets, that is everything except the characters shown.

For example if you want to match any pattern that does not include a digit, then you can use:

[^0-9]

Extensions

Some Unix utilities allow some extensions to what we have seen so far for working with regular expressions. For example, if you want to allow zero or more repetitions of the pattern a4 then you could use the regular expression:

(a4)*

Likewise, in awk, you can use + to mean match the previous character or regular (sub)expression one or more times (instead of zero or more times).

Also you can use ? to mean match the previous character or regular (sub)expression zero or one time only.

Finally you can use the | character as a logical "or" for matching either of the regular (sub)expressions on either side.

Some useful regular expressions

Here are some useful regular expressions that you may wish to use:

[A-Za-z][A-Za-z]*

The above will match any string of characters that don't have digits.

[+\-][0-9][0-9]*

The above will match any integer with a preceding + or -.

.*

This will match any string of characters.

Summary of regular expression (re) rules

Note: Not all re rules work with a grep utilities. See the man pages when in doubt.


*portions from here