sed scratch pad -- A thread of sed examples

Message

s243a · #1 Post by **s243a** » Sun 29 Dec 2019, 06:41

I find sed very difficult to grasp. This thread is to help demonstrate how to do things in sed which one might not be able to find example for online.

My first post is an example on how to call an external function in sed. Here is my example:

Code: Select all

echo a | sed -ne 's/\(.*\)/echo a\1/' -e 'e' -e 'p'

or alternatively:

Code: Select all

echo a | sed -ne '
s/\(.*\)/echo a\1/ #Replace "a" with echo aa
e                  #Execute the output of the last command
p                  #Print the result

the -n option is needed to keep sed from auto printing. Otherwise sed would print each line that it reads.

* The 's' denotes string substitution.
* The brackests "$...$" capture the text which matches the regular expression inside the brackets. In our case the regular expression is .* which means match any string (in our case 'a'). The value of the match can be retrieved with the back reference "\1". The backslash in front of each bracket isn't necessary if you use extended regular expressions. However, with extended regular expressions more escaping of special characters may be required.

Next we Execute the external command which is the output of our last expression. In our case we are executing the external command echo aa. The "e" character means execute the external command.

Finally we print the result. The 'p' command is used to print the result.

The output is "aa"

MochiMoppel · #2 Post by **MochiMoppel** » Sun 29 Dec 2019, 07:43

s243a wrote:My first post is an example on how to call an external function in sed. Here is my example:
Code: Select all
echo a | sed -ne 's/$.*$/echo a\1/' -e 'e' -e 'p'

Alternatively

Code: Select all

echo a | sed -nr 's/(.*)/echo a\1/ep'

or even shorter

Code: Select all

echo a | sed 's/.*/echo a&/e'

Beware that the e command is a GNU extension and most likely will only work with GNU sed. Does not work with busybox sed.
In my experience calling shell commands from within sed is very slow. Should probably be used only when no other alternatives exist.

s243a · #3 Post by **s243a** » Mon 30 Dec 2019, 00:37

MochiMoppel wrote:
s243a wrote:My first post is an example on how to call an external function in sed. Here is my example:
Code: Select all
echo a | sed -ne 's/$.*$/echo a\1/' -e 'e' -e 'p'
Alternatively
Code: Select all
echo a | sed -nr 's/(.*)/echo a\1/ep'
or even shorter
Code: Select all
echo a | sed 's/.*/echo a&/e'
Beware that the e command is a GNU extension and most likely will only work with GNU sed. Does not work with busybox sed.
In my experience calling shell commands from within sed is very slow. Should probably be used only when no other alternatives exist.

Thanks for the tips and warnings. I want to to hone my sed skills because it is used a lot. This means learning both standard sed an extensions.

s243a · #4 Post by **s243a** » Mon 30 Dec 2019, 00:47

I'm borrowing the next one, which simply numbers the lines of a file:

sed '/./=' test | sed '/./N; s/\n/ /'

http://tuxthink.blogspot.com/2012/01/ad ... -file.html

I spent a fiar bit of time trying to google better ways of doing this and while it can probably be done without a pipe the code to do so is probably more complex. The tricky thing about this problem is that the "=" command prints the line number but inserts a new line character after it.

In the above exmaple the "/./" means "match any non-empty line. Said match (pun unintended), wouldn't be necessary if we wanted to match every line.

This syntax is pattern (e.g. /./ ) command ( "=" ). When the pattern matches the command is executed (i.e. print the line number followed by a new line character).

The input file (i.e. test) is:

Hi
How are
You.

The first sed command in the pipe outputs:

1
Hi
2
How are
3
You.

The second sed command, reads two lines and then removes the new line character. The reading of the second line is done with the "N" command which appends the next line into pattern space. When sed prints it automatically inserts a new line character at the end of the output, unless you use the "-z" option is used in which case the null character (i.e. $\'0' ) is used instead of the new line character.

Many of the sed man pages don't mention that you can use the null caracter as the new line seperator. One way to perhaps do this in a single sed script is to use the "-z" option but there there will be hidden null characters in the output.

As a final note, the fact that we need two sed commands to do this means that some other utility is probably preferable for this application. However, there may be times where one has good reason to pipe sed to sed, in which case this example might be a good starting point.

rockedge · #5 Post by **rockedge** » Mon 30 Dec 2019, 01:02

thanks guys for the sed tips....I'm beginning to use it more often

MochiMoppel · #6 Post by **MochiMoppel** » Mon 30 Dec 2019, 02:04

s243a wrote:I'm borrowing the next one, which simply numbers the lines of a file

1) The code numbers only non-empty lines of a file. Intentionally?
2) The linked page shows the output having periods after the line numbers, which is not the output of this code
3) Not a mistake but still bad: Naming a file 'test' can lead to nasty errors since test is also the name of a shell command.

s243a wrote:In the above exmaple the "/./" means "match any line.

No, it means "match any line containing at least one character"
For matching any line the code could have used "/^/" or simply no match pattern at all:

Code: Select all

sed = filename | sed 'N;s/\n/ /'

s243a wrote:When sed prints it automatically inserts a new line character at the end of the output, unless you use the "-z" option is used in which case the null character (i.e. $\'0' ) is used instead of the new line character.

???
It never adds a new line character at the end of the output and I doubt that the -z option would add a null character. Have you tried this?

s243a wrote:Many of the sed man pages don't mention that you can use the null caracter as the new line seperator. One way to perhaps do this in a single sed script is to use the "-z" option but there there will be hidden null characters in the output.

I assume that one reason for not mentioning this option is the fact that it's relatively new. My GNU sed version 4.2.1 knows nothing about it. My understanding is that it treats null characters in the input like it would treat linefeeds without this option. It would treat "real" linefeeds as normal characters. Neither null characters nor linefeeds would be stripped or changed for the output, unless explicitely changed by the code.

s243a · #7 Post by **s243a** » Mon 30 Dec 2019, 21:39

MochiMoppel wrote:
s243a wrote:I'm borrowing the next one, which simply numbers the lines of a file
1) The code numbers only non-empty lines of a file. Intentionally?
2) The linked page shows the output having periods after the line numbers, which is not the output of this code
3) Not a mistake but still bad: Naming a file 'test' can lead to nasty errors since test is also the name of a shell command.

s243a wrote:In the above exmaple the "/./" means "match any line.
No, it means "match any line containing at least one character"

Yes. I realized this after reading point "1" above. I suppose it is cleaner to not number empty lines.

For matching any line the code could have used "/^/" or simply no match pattern at all:

Agreed.

s243a · #8 Post by **s243a** » Mon 30 Dec 2019, 22:02

MochiMoppel wrote:
s243a wrote:When sed prints it automatically inserts a new line character at the end of the output, unless you use the "-z" option is used in which case the null character (i.e. $\'0' ) is used instead of the new line character.
???
It never adds a new line character at the end of the output and I doubt that the -z option would add a null character. Have you tried this?

s243a wrote:Many of the sed man pages don't mention that you can use the null caracter as the new line seperator. One way to perhaps do this in a single sed script is to use the "-z" option but there there will be hidden null characters in the output.
I assume that one reason for not mentioning this option is the fact that it's relatively new. My GNU sed version 4.2.1 knows nothing about it. My understanding is that it treats null characters in the input like it would treat linefeeds without this option. It would treat "real" linefeeds as normal characters. Neither null characters nor linefeeds would be stripped or changed for the output, unless explicitely changed by the code.

We'll look into how sed actually works here later, but for now consider the following:

Code: Select all

[root@dpupbuster64 ~] $ { echo -n a; printf '\0'; echo -n b; } | sed -z p | tr '\0' '\n'; echo ""
a
a
b
b
[root@dpupbuster64 ~] $ { echo -n a; printf '\0'; echo -n b; } | sed -z p; echo ""
aabb
[root@dpupbuster64 ~] $ { echo -n a; printf '\0'; echo -n b; } | sed -nz p; echo ""
ab
[root@dpupbuster64 ~] $ { echo -n a; printf '\0'; echo -n b; } | sed -nz p | tr '\0' '\n'; echo ""
a
b
[root@dpupbuster64 ~] $ { echo -n a; printf '\0'; echo -n b; } | sed -zne 's/\(.*\)/c\1/;p' | tr '\0' '\n'; echo ""
ca
cb

I need some time to ponder this and part of pondering it is figuring out how to properly test it.

Note that I had to use the printf function because apparently in bash you can't sotre a null character in a variable (or even string?).

BTW on dpup buster64 we have "sed (GNU sed) 4.7"

Edit: so considering the above here is how we can do it in a single sed command:

Code: Select all

[root@dpupbuster64 ~] $ { echo -n a; printf '\0'; echo -n b; } | sed -zne '=' -rne 's/([^0-9]+.*)/\1\n/;p'
1a
2b

Of course there are hidden null characters here.

Code: Select all

[root@dpupbuster64 ~] $ { echo -n a; printf '\0'; echo -n b; } | sed -zne '=' -rne 's/([^0-9]+.*)/\1\n/;p' | tr '\0' '.'
1.a
.2.b

MochiMoppel · #9 Post by **MochiMoppel** » Tue 31 Dec 2019, 08:59

s243a wrote:Note that I had to use the printf function

Instead of
{ echo -n a; printf '\0'; echo -n b; }
try
echo -ne "a\x00b"

s243a · #10 Post by **s243a** » Tue 31 Dec 2019, 16:03

MochiMoppel wrote:
s243a wrote:Note that I had to use the printf function

Instead of
{ echo -n a; printf '\0'; echo -n b; }
try
echo -ne "a\x00b"

That also worked.

Code: Select all

# echo -ne "a\x00b" | sed -zne '=' -rne 's/([^0-9]+.*)/\1\n/;p' | tr '\0' '.'
1.a
.2.b

Thanks for the tip.

Do you have any documentation on those kinds of codes with the echo command?

s243a · #11 Post by **s243a** » Tue 31 Dec 2019, 18:47

The next example, I will also borrow:

Code: Select all

sed -e '/./{H;$!d;}' -e 'x;/Administration/!d' thegeekstuff.txt

https://www.thegeekstuff.com/2009/12/un ... perations/

What this example does is prints paragraphs that contain the word Administration. The two ways to solve this problem, which are apparent to me are as follows:
1. Either use the hold space or alternatively
2. Use Loops.

The above example uses approach #1. I will also try this in another post using approach #2.

The reason that we use the "hold space" here is that when sed reads the next line of input [1], as part of a new cycle, the previous line that is in pattern space is replaced by the line of text read in the next cycle (see execution cycle). The two ways around this -- as noted above -- are to either append the previous line to hold space before starting the next cycle, or alternatively use the "N" command to append the next line of text (as a new line), into pattern space.

So anyway recall that /./ matches non blank lines. If there is a match, then we use the "H" command to append the line in we just read from standard in (which is currently in pattern space), into hold space. After this you'll notice "$!d", which means if we are at the last line than delete the pattern space. See "Relations between d, p, and !" at:
https://www.grymoire.com/Unix/Sed.html

Anyway, I'm not really sure of the point of doing this since in the next command (i.e. 'x') we replace the pattern space with the contents of the hold space, which will effectively delete the previous pattern space anyway. The final action in the script is:

Code: Select all

/Administration/!d'

which means, "Delete the paragraph if it doesn't contain the word "Administration".

Notes
---------------------
1. We call it the "next line of input" but the lines can be separated either by a new line character, or in the case of the "-z" option a null character. The -z option is only available in newer versions of sed.

MochiMoppel · #12 Post by **MochiMoppel** » Wed 01 Jan 2020, 02:40

s243a wrote:Do you have any documentation on those kinds of codes with the echo command?

Not sure what you mean by "those kinds of codes". You'll find a good starting point right at your fingertips:

Code: Select all

help echo

I recommend to stay away from octal codes and always use hex codes since with hex the syntax in bash echo, busybox echo and bash printf is the same. And unless you know what you are doing you should avoid to use abbreviated codes like '\0'. You'll always be safe when you use 3 digits for octal and 2 digits for hex.

MochiMoppel · #13 Post by **MochiMoppel** » Tue 14 Jan 2020, 09:47

Looks like another abandoned thread

I'll give it a try anyway since I don't know where to ask.
The challenge is to remove all comments from a XML/HTML document, using only sed.

Example text:

Code: Select all

<JWM>
	<Tray  autohide="false" insert="right" x="0" y="-1" border="1" height="28" >
		<!-- Additional TrayButton attribute: label -->
		<TrayButton label="Menu" icon="logo-mini.png" border="true">root:3</TrayButton>
border="true">exec:urxvt</TrayButton>
		<Pager/>
		<!-- Additional TaskList attribute: maxwidth -->
		<TaskList maxwidth="200"/>
		<Dock/>
		<!-- Additional Swallow attribute: height -->
	<!--	<Swallow name="blinky">
			blinkydelayed -bg "#DCDAD5"
		</Swallow> -->
	<!--	<Swallow name="xtmix-launcher">
			xtmix -launch
		</Swallow> -->
	<!--	<Swallow name="asapm">
			asapmshell -u 4
		</Swallow> -->
	<!--	<Swallow name="freememapplet" width="34">
			freememappletshell
		</Swallow> -->
		<Swallow name="xload" width="32">
			xload -nolabel -bg "#888888" -fg red -hl white
		</Swallow>
		<Clock format="%H:%M">minixcal</Clock>
	</Tray>
</JWM>

The problem is that these comments can be multiline. My rough idea is to let sed move a line to the hold buffer when a ' is detexted', load the hold buffer into the pattern space and remove the comment, clear the hold buffer and continue with the next cycle. May not be the right way and I'm not even close to achieve the goal. Does anybody know how to do this?

s243a · #14 Post by **s243a** » Tue 14 Jan 2020, 13:48

MochiMoppel wrote:Looks like another abandoned thread

I'll give it a try anyway since I don't know where to ask.
The challenge is to remove all comments from a XML/HTML document, using only sed.

Example text:
Code: Select all
<JWM>
	<Tray  autohide="false" insert="right" x="0" y="-1" border="1" height="28" >
		
		<TrayButton label="Menu" icon="logo-mini.png" border="true">root:3</TrayButton>
border="true">exec:urxvt</TrayButton>
		<Pager/>
		
		<TaskList maxwidth="200"/>
		<Dock/>
		
	
	
	
	
		<Swallow name="xload" width="32">
			xload -nolabel -bg "#888888" -fg red -hl white
		</Swallow>
		<Clock format="%H:%M">minixcal</Clock>
	</Tray>
</JWM>
The problem is that these comments can be multiline. My rough idea is to let sed move a line to the hold buffer when a ' is detexted', load the hold buffer into the pattern space and remove the comment, clear the hold buffer and continue with the next cycle. May not be the right way and I'm not even close to achieve the goal. Does anybody know how to do this?

I have to go to work so something like:

Code: Select all

#If we don't yet have a terminating comment just append to the hold space and start the next cycle. 
/.*-->.*/!{
  H #Append pattern space to hold space
  d #Delete pattern space and start next cycle 
  }
#If we have a closing comment append data to hold space and copy the hold space to the pattern space to see if we can match both an opening and closing comment in pattern space. 
/.*-->.*/ { 
    H #Append new data to hold space  
    x #Exchange hold space with pattern space
    h #Copy pattern space to hold space
  }
#If this block matches the previous block has already been executed and this block will be executed next. 
/.*<!--.* -->.*./ { 
    s/<!--.* -->// #Delete comment
    p #Print patter space
    s/.*//g #delete pattern space
    x #exchange pattern space with hold space
    d #delete pattern space and start next cycle.
  }

I might test this latter. We'll see.

6502coder · #15 Post by **6502coder** » Tue 14 Jan 2020, 16:24

Isn't this essentially the same as the problem of using sed to remove comments from a C program, for which Googling turns up a bunch of suggestion? I haven't looked into this carefully, just making an observation.

MochiMoppel · #16 Post by **MochiMoppel** » Tue 14 Jan 2020, 16:53

Yes, essentially the same, but the suggestions I've seen so far are either not for sed, are crap or a combination of both.

Keef · #17 Post by **Keef** » Tue 14 Jan 2020, 18:42

Is this any good?
https://stackoverflow.com/questions/405 ... ing-regexp

Using your example, the ouput I get is:

Code: Select all

# cat file.html | sed -e :a -re 's/<!--.*?-->//g;/<!--/N;//ba'
<JWM>
   <Tray  autohide="false" insert="right" x="0" y="-1" border="1" height="28" >
      
      <TrayButton label="Menu" icon="logo-mini.png" border="true">root:3</TrayButton>
border="true">exec:urxvt</TrayButton>
      <Pager/>
      
      <TaskList maxwidth="200"/>
      <Dock/>
      
   
   
   
   
      <Swallow name="xload" width="32">
         xload -nolabel -bg "#888888" -fg red -hl white
      </Swallow>
      <Clock format="%H:%M">minixcal</Clock>
   </Tray>
</JWM>
#

sc0ttman · #18 Post by **sc0ttman** » Tue 14 Jan 2020, 20:30

I'd love to get this working:

An HTML minifier...

This thing nearly does the job, except that it minifies stuff inside <pre> tags...

I would love love love to fix that!!

Code: Select all

function minify_html {
  # temp fix to IFS, just in case the hmtl files contain spaces
  OLD_IFS=$IFS
  IFS="
  "
  for html_file in $html_files
  do
    :
    # dont minify HTML until we can skip contents of <pre>..</pre>
    #sed ':a;N;$!ba;/<div class="highlight"><pre>\.*<\/pre><\/div>/! s@>\s*<@><@g' $html_file > ${html_file//.html/.minhtml}
    #mv ${html_file//.html/.minhtml} ${html_file}
  done
  IFS=$OLD_IFS
}

s243a · #19 Post by **s243a** » Wed 15 Jan 2020, 04:05

Keef wrote:Is this any good?
https://stackoverflow.com/questions/405 ... ing-regexp

Using your example, the ouput I get is:

Code: Select all

# cat file.html | sed -e :a -re 's/<!--.*?-->//g;/<!--/N;//ba'
<JWM>
   <Tray  autohide="false" insert="right" x="0" y="-1" border="1" height="28" >
      
      <TrayButton label="Menu" icon="logo-mini.png" border="true">root:3</TrayButton>
border="true">exec:urxvt</TrayButton>
      <Pager/>
      
      <TaskList maxwidth="200"/>
      <Dock/>
      
   
   
   
   
      <Swallow name="xload" width="32">
         xload -nolabel -bg "#888888" -fg red -hl white
      </Swallow>
      <Clock format="%H:%M">minixcal</Clock>
   </Tray>
</JWM>
#

I get the same output with:

Code: Select all

#Match the last line
$,/.*/ {    
	H #Append new data to hold space 
    x #Exchange hold space with pattern space
    s/<!--.*-->//g #Delete comment
    p #Print pattern space
  }    
#If we don't yet have a terminating comment just append to the hold space and start the next cycle.
/.*-->.*/! {
  H #Append pattern space to hold space
  d #Delete pattern space and start next cycle
  }
#If we have a closing comment append data to hold space and copy the hold space to the pattern space to see if we can match both an opening and closing comment in pattern space.
/.*-->.*/ {
    H #Append new data to hold space 
    x #Exchange hold space with pattern space
    h #Copy pattern space to hold space
  }
#If this block matches the previous block has already been executed and this block will be executed next.
/.*<!--.*-->.*/ {
    s/<!--.*-->//g #Delete comment
    p #Print pattern space
    s/.*//g #delete pattern space
    x #exchange pattern space with hold space
    d #delete pattern space and start next cycle.
  }

Test program:
https://pastebin.com/tNttFjyT

jamesbond · #20 Post by **jamesbond** » Wed 15 Jan 2020, 05:29

MochiMoppel wrote:The challenge is to remove all comments from a XML/HTML document, using only sed.

Challenge accepted.

This removes the comments and cleans up stray newlines.

Code: Select all

sed -n 'H;x;s/<!--.*-->//;x;${x;s/\n//;s/\n[ \n\t]*\n/\n/g;p}' test.html

If you only want to remove the comments and don't worry about how it looks, this will do.

Code: Select all

sed -n 'H;x;s/<!--.*-->//;x;${x;p}' test.html

Confirmed to work with gnu sed and busybox sed.

(old)Puppy Linux Discussion Forum

(old)Puppy Linux Discussion Forum

sed scratch pad -- A thread of sed examples

sed scratch pad -- A thread of sed examples

Re: sed scratch pad -- A thread of sed examples

Re: sed scratch pad -- A thread of sed examples

html minifier in sed