Handout 22

sed*


Introduction

SED is a stream editor. A stream editor is used to perform basic text transformations on an input stream (a file or input from a pipeline). While in some ways similar to an editor which permits scripted edits (such as ED), SED works by making only one pass over the input(s), and is consequently more efficient. But it is SED's ability to filter text in a pipeline which particularly distinguishes it from other types of editors.

Sed regular expressions

The sed regular expressions are essentially the same as the grep regular expressions. They are summarized below.
^ matches the beginning of the line
$ matches the end of the line
. Matches any single character
(character)* match arbitrarily many occurences of (character)
(character)? Match 0 or 1 instance of (character)
[abcdef] Match any character enclosed in [] (in this instance, a b c d e or f) ranges of characters such as [a-z] are permitted.
[^abcdef] Match any character NOT enclosed in [] (in this instance, any character other than a b c d e or f)
(character)\{m,n\} Match m-n repetitions of (character)
(character)\{m,\} Match m or more repetitions of (character)
(character)\{,n\} Match n or less (possibly 0) repetitions of (character)
(character)\{n\} Match exactly n repetitions of (character)
\(expression\) Group operator.
\n Backreference - matches nth group
expression1\|expression2 Matches expression1 or expression 2. Works with GNU sed, but this feature might not work with other forms of sed.

How it Works: A Brief Introduction

Sed works as follows: it reads from the standard input, one line at a time. for each line, it executes a series of editing commands, then the line is written to STDOUT. An example which shows how it works : we use the s command. s means "substitute" or search and replace. The format is

s/regular-expression/replacement text/{flags}

flags - g means "replace all matches"

>cat file
I have three dogs and two cats
>sed -e 's/dog/cat/g' -e 's/cat/elephant/g' file
I have three elephants and two elephants
>

Firsty, sed read in the line of the file and executed

s/dog/cat/g

which produced the following text:

I have three cats and two cats

and then the second command was performed on the edited line and the result was

I have three elephants and two elephants

We actually have a name for the "current text": it is called the pattern space. So a precise definition of what sed does is as follows :

sed reads the standard input into the pattern space, performs a sequence of editing commands on the pattern space, then writes the pattern space to STDOUT.

Getting Started: Substitute and delete Commands

sed is usually used as follows:

>sed -e 'command1' -e 'command2' -e 'command3' file
>{shell command}|sed -e 'command1' -e 'command2'
>sed -f sedscript.sed file
>{shell command}|sed -f sedscript.sed

so sed can read from a file or STDIN, and the commands can be specified in a file or on the command line. Note the following :

that if the commands are read from a file, trailing whitespace can be fatal, in particular, it will cause scripts to fail for no apparent reason.

The Substitute Command

The format for the substitute command is as follows:

[address1[ ,address2]]s/pattern/replacement/[flags]

The flags can be any of the following:
n replace nth instance of pattern with replacement
g replace all instances of pattern with replacement
p write pattern space to STDOUT if a succesful substitution takes place
w file Write the pattern space to file if a succesful substitution takes place

If no flags are specified the first match on the line is replaced. note that we will almost always use the s command with either the g flag or no flag at all.

If one address is given, then the substitution is applied to lines containing that address. An address can be either a regular expression enclosed by forward slashes /regex/ , or a line number . The $ symbol can be used in place of a line number to denote the last line.

If two addresses are given seperated by a comma, then the substitution is applied to all lines between the two lines that match the pattern.

This requires some clarification in the case where both addresses are patterns, as there is some ambiguity here. more precisely, the substitution is applied to all lines from the first match of address1 to the first match of address2 and all lines from the first match of address1 following the first match of address2 to the next match of address1 Don't worry if this seems very confusing (it is), the examples will clarify this.

The Delete Command

The delete command is very simple in it's syntax: it goes like this

[address1[ , address2 ] ]d

And it deletes the content of the pattern space. All following commands are skipped (after all, there's very little you can do with an empty pattern space), and a new line is read into the pattern space.

Example 1

>cat file

http://pegasus.rutgers.edu/

>sed -e 's@http://www.foo.com@http://www.bar.net@' file

http://pegasus.rutgers.edu/

Note that we used a different delimiter, @ for the substitution command. Sed permits several delimiters for the s command including @%,;: these alternative delimiters are good for substitutions which include strings such as filenames, as it makes your sed code much more readable.

Example 2

>cat file

the black cat was chased by the brown dog

>sed -e 's/black/white/g' file

the white cat was chased by the brown dog

That was pretty straight forward. Now we move on to something more interesting.

Example 3

>cat file

the black cat was chased by the brown dog.
the black cat was not chased by the brown dog

>sed -e '/not/s/black/white/g' file

the black cat was chased by the brown dog.
the white cat was not chased by the brown dog.

In this instance, the substitution is only applied to lines matching the regular expression not. Hence it is not applied to the first line.

Example 4

>cat file

line 1 (one)
line 2 (two)
line 3 (three)

Example 4a

>sed -e '1,2d' file

line 3 (three)

Example 4b

>sed -e '3d' file
line 1 (one)
line 2 (two)

Example 4c

>sed -e '1,2s/line/LINE/' file

LINE 1 (one)
LINE 2 (two)
line 3 (three)

Example 4d

>sed -e '/^line.*one/s/line/LINE/' -e '/line/d' file

LINE 1 (one)

4a : This was pretty simple: we just deleted lines 1 to 2.
4b : This was also pretty simple. We deleted line 3.
4c : In this example, we performed a substitution on lines 1-2.
4d : now this is more interesting, and deserves some explanation. Firstly, it is clear that line 2 and 3 get deleted. But let's look closely at what happens to line 1.
First, line 1 is read into the pattern space. It matches the regular expression ^line.*one So the substitution is carried out, and the resulting pattern space looks like this:

LINE 1 (one)

So now the second command is executed, but since the pattern space does not match the regular expression line, the delete command is not executed.

Example 5

>cat file

hello
this text is wiped out
Wiped out
hello (also wiped out)
WiPed out TOO!
goodbye
(1) This text is not deleted
(2) neither is this ... ( goodbye )
(3) neither is this
hello
but this is
and so is this
and unless we find another g**dbye
every line to the end of the file gets deleted

>sed -e '/hello/,/goodbye/d' file

(1) This text is not deleted
(2) neither is this ... ( goodbye )
(3) neither is this

This illustrates how the addressing works when two pattern addresses are specified. sed finds the first match of the expression "hello", deleting every line read into the pattern space until it gets to the first line after the expression "goodbye". It doesn't apply the delete command to any more addresses until it comes across the expression "hello" again. Since the expression "goodbye" is not on any subsequent line, the delete command is applied to all remaining lines.


Some More Commands

Backreferences in Sed

One of the nice things about backreferencing in sed is that you can use it not just in the search text, but in the replacement text.

The quit command

The quit or q command is very simple. It simply quits. No more lines are read into the pattern space and the program terminates and produces no more output.

Subroutines

We now introduce the concept of subroutines in sed:

In sed, curly braces, { } are used to group commands. They are used as follows:
address1[,address2]{
commands }

Example: Find First Word From a List in a File

This example makes very good use of all the concepts outlined above.

For this, we use a shell script, since we need to state the one long string X several times (otherwise, we'd need to repeat ourselves three times with a somewhat lengthy expression). Notice that we use double quotes. This is so that $X is expanded to the shell variable name (which would not happen if we used single quotes). Also notice the $1 on the end. The syntax to run this script is script search_filename where script is whatever you decided to call it and search_filename is the file you are trying to search. $1 is the name the shell gives to the first command line argument.

#!/bin/sh
X='word1\|word2\|word3|\word4|\word5'
sed -e "
/$X/!d
/$X/{
	s/\($X\).*/\1/
	s/.*\($X\)/\1/
	q
	}" $1

An important note: it is tempting to think of this:

s/\($X\).*/\1/
s/.*\($X\)/\1/

as redundant, and to try and shorten it with this:

s/.*\($X\).*/\1/

This is unlikely to work. Why ? suppose we have a line

word1 word2 word3

we have no way of knowing that $X is going to match word1 , word2 or word3, so when we quote it (\1 ) , we don't know what we are quoting.

What has been used to make sure there are no such problems in the correct implementation is this:

the * operator is greedy. That is, when there is ambiguity as to what (expression)* can match, it tries to match as much as possible.

So in the example, s/\($X\).*/\1/ , .* tries to swallow as much of the line as possible. in particular, if the line looks like

word1 word2 word3

then we can be sure that .* matches " word2 word3" and hence $X matches word1.

Pattern Matching Across More than 1 Line

Yes, this is something that a lot of people want to do (whether they realise it or not) as the s/pattern1/replacement/ does not work if the string spans more than one line.

Example

Suppose we want to replace every instance of Microsoft Windows 95 with Linux (I mean, just replace the text !). Our first attempt is this:

s/Microsoft Windows 95/Linux/g

Unfortunately, the script fails if our file looks like this:

Microsoft
Windows 95

Since neither line matches the pattern microsoft Windows 95

So we need to do better. We need the "multiline next" or N command.

The next command N appends the next line to the pattern space.

So our second attempt is this:

N
N
s/Microsoft[ \t\n]*Windows[ \t\n]*95/Linux/g

Now note that we have made reference to \t and \n. These are the tab and end of line characters respectively. The end of line character only appears in multiline patterns. In multiline patterns, it should also be noted that ^ and $ match the beginning and end of the pattern space.

The above is a start, but it breaks if we have a file that looks like this:

Foo
Microsoft 
Windows 
95

Why does it break ? Let's look at what the script does.

  1. First, it reads the line "Foo" into the pattern space.
  2. It sees the N command and appends line 2 to the pattern space. The pattern space now looks like:

    Foo\nMicrosoft

  3. Executing the second N command , it reads line 3 into the pattern space. At this stage, the pattern space looks like this:

    Foo\nMicrosoft\nWindows

  4. Now the script runs the substitute command.

    Foo\nMicrosoft\nWindows

    This doesn't match the search pattern, so no substitution is performed.
  5. Since the end of the script is reached, the contents of the pattern space are written to STDOUT , and the script starts again from the first line
  6. The last line of the file "95" is read into pattern space.

    This is the main error in the script : once the end of the script is reached, the first line that * has not been read into the pattern space already * is read. It is NOT true that the Nth iteration of the script reads from the Nth line of the file.

    The following too N commands fail and the script exits without writing '95' to STDOUT.

So there are too things to be learned from this:

More on Using sed

cat inputfile | sed -f a.sed > outputfile

where `a.sed' is a file that has sed commands in them. Or if you'd rather you can run sed in command mode:

sed commands tend to look like those in vi. Here are some examples:

   COMMAND				         FUNCTIONALITY
s/string1/string2/              substitutes string2 for the first occurrance of
                                string1 in each line

s/string1/string2/g             substitutes string2 for string1 everywhere
				                 in each line

2 s/limits.*/hello/             in line 2, looks for the string 'limits.*'
                                where * is any string, and replaces with 'hello'

2,4 s/junk/try/                 substitutes 'try' for the first occurance of 'junk'
                                in lines 2 through 4 only

8 a\                            appends the string 'set c=818.'
set c=818.                      after line 8

8 a\                            appends the string 'i1=(i+c)/c - 1'
set i1=(i+c)\/c - 1             after line 8. Note the use of the \ is needed to print out
                                `special characters', like /, &, %, $, and of course, \ itself

8 i\                            inserts the string 'set c=818.'
set c=818.                      before line 8


sed Examples

1. STRIP ALL LEADING BLANKS FROM EVERY LINE OF A FILE

   sed  's/^ *//' testfile.1 > testfile.new

2. DELETE EVERY LINE THAT BEGINS WITH A DOT

sed '/^\./d' testfile.1 > testfile.new

3. REPLACE ALL STRINGS OF BLANKS BY SINGLE BLANKS.

sed 's/  */ /g' testfile.1 > testfile.new

4. DO #3 FOR EVERY FILE IN A COLLECTION AND PLACE THE RESULTS IN A SUBDIRECTORY:

mkdir lower
for FILE in testfile.*
do
sed 's/  */ /g' $FILE > lower/$FILE
done

5. Suppose you have 50 files, and they all need to have 'junk' replaced by 'try' in lines 2 through 10. You could edit all 50 files individually (what a pain), or do it quickly with sed. If the files are named file1, file2, etc. you could do the following:

vi a.sed, and create a file that looks like:

2,10 s/junk/try/

Back at the unix prompt type:
ls file*

You get:
file1
file10
file11
...
file50

Now type:
ls file* | sed 's/.*/cat & | sed -f a.sed > tmp; mv -f tmp &/' > doit

which is of the form:
inputlines | sed 'sed-command' > outputfile

sed stores each line (the .* in the above command) in a variable called & (it does this
by default), and substitutes 

cat & | sed -f a.sed > tmp; mv -f tmp &

for the line. 

So the file 'doit' looks like:
cat file1 | sed -f a.sed > tmp; mv -f tmp file1
cat file10 | sed -f a.sed > tmp; mv -f tmp file10
cat file11 | sed -f a.sed > tmp; mv -f tmp file11
...
cat file50 | sed -f a.sed > tmp; mv -f tmp file50

Execute these commands by typing: 
cat doit | csh

You can do all of this in one line by typing:
ls file* | sed 's/.*/cat & | sed -f a.sed > tmp; mv -f tmp &/' | csh

Quick Reference Card

----------------------------------------------------------------------------------------
                             SED -- STREAM EDITOR

      sed 'command;command;command' somefile                see results
      sed 'command;command;command' somefile > newfile      save results
      .... | sed 'command;command;command' | ....           use in a pipeline
      sed -f sedcommands somefile > newfile                 commands are in file
                                                            somewhere else

   Commands:

      s   -- substitute strings
      y   -- translate characters
      d   -- delete lines
      !d  -- extract (keep) lines
      p   -- print lines

MOST COMMON PATTERN:   substitute every occurrence of a string "old" with "new"

      sed 's/old/new/g' somefile > newfile
      
The patterns in these commands use regular expressions for pattern matching:

      *       0 or more occurrences of previous character
      .       any character except newline
      [...]   any single occurrence of character in set
      [a-z]   any single occurrence of character in range
      [^...]  any single occurrence of character NOT in set (or range)
      ^       beginning of line
      $       end of line
      \       escapes the following character
      \{n,m\} range of occurrences, n and m are integers
      +       one or more occurrence of previous character
      ?       0 or more occurrences of preceding regular expression
      |       match either r.e. on left or r.e. on right
      ()      groups regular expressions

   EXAMPLES:
     /^M.*/   Line begins with capital M, 0 or more chars follow
     /..*/    At least 1 character long
     /^$/     The empty line
     ab|cd    Either ab or cd

Examples:

     sed 's/^Unix/UNIX(TM)/'  ...    The caret (^) matches the beginning of a
                                     line, so this only changes Unix if it is at
                                     the beginning of a line

     sed 's/:$//' ...                The dollar sign ($) matches the end of a 
                                     line, so this removes a colon only if it is
                                     at the end of a line.


*portions from here and here and here.