.
Manucomp Systems
Hours of Operation

Monday to Friday:
9am - 6pm EST

Saturday & Sunday:
Closed

If you would like additional information please contact us toll-free at :

1-866-440-1115
info@manucomp.com

Can't find the product you are looking for?
Request a quote.
Extending Grep
Back to Sun Tips


Extending Grep


At this point grep and egrep depart from one another. egrep stands for extended grep. The POSIX 1003.2 standard defined a set of regular expression characters, called modern, extended, or full regular expressions. The regular expressions I cited earlier are frequently called older or basic regular expressions. There is some overlap between the two, and recent versions of grep can be made to behave like egrep by using the -E option.

The egrep utility uses extended regular expressions, with a useful one being the plus (+) character, which works like the asterisk (*) but means "one or more" rather than "zero or more." Using egrep in the above example with a + instead of an * would cause the search to exclude "ct" because it doesn't contain one or more vowels.

$ egrep 'c[aeiou]+t' somewords.txt
cat
coat
coot
cot
cout
cut
$

If you use grep to achieve the same results, the search pattern becomes clumsier. The next example asks for "c," followed by any vowel, followed by zero or more occurrences of any vowel, followed by "t."

$ grep 'c[aeiou][aeiou]*t' somewords.txt
cat
coat
coot
cot
cout
cut
$

The egrep utility also adds a question mark (?), meaning zero or one
occurrence, as another version of multiple occurrence matching.

* = zero or more occurrences
+ = one or more occurrences
? = zero or one occurrence

The vertical bar (|) creates an "or" condition between two possible search patterns.
In the following example, egrep searches for "c," followed by one or more vowels, followed by "t," or for "p" followed by one or more vowels, followed by "l." Because the search string doesn't specify that the word must end after the closing "t" or "l,"
this example has matched "paula" and "paella," as well as words that end in "l."

$ egrep 'c[aeiou]+t|p[aeiou]+l' somewords.txt
cat
coat
coot
cot
cut
cet
cit
pal
paella
paul
paula
peal
peel
pool
$

You can fudge this with grep by entering multiple search patterns and inserting newlines in between the patterns. This can be used with egrep and fgrep as well,
but I'm introducing it here simply to highlight the difficulty of imitating egrep with grep when it would be simpler to use egrep.

In the following example, the first part of the command is entered on one line, and then Enter is pressed while the single quotes are still open. The shell prompts for additional input and continues to accept lines until the closing quote appears. Each individual line represents a separate search string to grep. This trick is useful with any version of grep.

$ grep 'c[aeiou][aeiou]*t > p[aeiou][aeiou]*l' somewords.txt
cat
coat
coot
cot
cut
cet
cit
pal
paella
paul
paula
peal
peel
pool
$

With egrep, simple parentheses can be used to group sections of a search pattern together. In the following example, the search pattern will match any of the words shown in the result list. The parentheses group "[Ss]ome" and "[Aa]ny" are optional strings, followed by "one."

$ egrep '([Ss]ome|[Aa]ny)one' somewords.txt
someone
Someone
anyone
Anyone
$

A single character can be modified by a bound, which consists of one or two
comma-separated numbers, with the first number specifying the minimum number and the second specifying the maximum. egrep uses curly braces ({}) to specify a bound, while grep uses back-slashed curly braces (\{\}). These example matching strings of characters should clarify what I mean:

egrep grep meaning
[a-z]{2,4} [a-z]\{2,4\} Two through four characters
[a-z]{4} [a-z]\{4\} Exactly four characters
[a-z]{4,} [a-z]\{4,\} Four or more characters
[a-z]{,4} [a-z]\{,4\} Zero through four characters

Finally, the escape or backslash (\) removes the special meaning of a character and reverts it to a standard character. Some simple examples are illustrated below. Note that the backslash itself has a special meaning, so when you want to search for it, it must be escaped (\\).

character matches
. Any character
\. A period
$ End of line
\$ A dollar sign
* Zero or more occurrences of the preceding expression
\* An asterisk
\ Nothing -- is an escape character
\\ A backslash
| Create an "or" branch between two expressions
\| A vertical bar

The definition of the escape character dictates that, if you escape a character that doesn't need to be escaped, the escape is ignored and the character is treated as if you had entered it on its own. If you place \a in a search pattern, it's the same as a, because the letter didn't need to be escaped in the first place.

It can be hard to remember all of the grep and egrep characters that have a special meaning, and regular expressions are unfortunately far from regular. You have already seen that curly braces can be escaped in grep and, when escaped, acquire a special meaning. The same is true for parentheses and angle brackets. The following characters have special meanings in grep or egrep:

In egrep:

| ^ $ . * + ? ( ) [ { } \

In grep:

^ $ . * \( \) [ \{ \} \

Because regular expressions are used by vi, ex, sed, and ed, it's worth mentioning that these three editors use the following special characters:

^ $ . * \( \) [ \ \< \>

As you can see, you need to be aware of the version of grep with which you're working before you use the backslash indiscriminately.

The last collection of grep or egrep search pattern options is in fact a simple shorthand for describing a class of characters.

[:alpha:] Any alphabetic character
[:lower:] Any lowercase character
[:upper:] Any uppercase character
[:digit:] Any digit
[:alnum:] Any alphanumeric character (alphabetic or digit)
[:space:] Any white space character (space, tab, vertical tab)
[:graph:] Any printable character, except space
[:print:] Any printable character, including the space
[:punct:] Any punctuation (i.e., a printable character that ...
[:cntrl:] Any nonprintable character

You may use these inside a range option. The class name includes the left and right brackets, so these must be doubled inside a range, as in the following example, which searches for any string of 10 digits. Note the apparently doubled brackets. Actually, this is an option of [:digit:] inside the square brackets for a range. This could also be written [0-9].

$ egrep '[[:digit:]]{10}' somenumbers.txt
1234554321
$

The following listing offers some example search patterns that return the line numbers containing the matches. Pattern 1 -- parentheses, followed by three digits, followed by closing parentheses, followed by three digits, a hyphen, and four digits -- searches for phone numbers.

Pattern 2 searches for zip codes -- five digits followed by zero or one hyphen, followed by zero to four digits -- either with or without the following hyphen and four digit extension.

Pattern 3 searches for lines containing P.O. Box number addresses by using a case-independent search for "p," followed by zero or one period, then zero or more spaces, zero or one period and one or more spaces, and finally "box" or "drop." This should match most of the styles of data entry for a P.O. Box, including "PO Box," "PO BOX," "P.O. Box," "P O Box," "P. O. Drop," and so on.

Pattern 4 matches the word "cat" by searching for it where it's preceded by a beginning or line, or one or more spaces and followed by one or more spaces, or an end of line. This search will not match "concatenate."

1.   egrep -n '\([0-9]{3}\)[0-9]{3}\-[0-9]{4}' somenumbers.txt   
2. egrep -n '[0-9]{5}\-?[0-9]{0,4}' somenumbers.txt
3. egrep -in 'p\.? *o\. +(box|drop)' someaddresses.txt
4. egrep -n '(^| +)cat( +|$)' sometext.txt