sed scratch pad -- A thread of sed examples

Message

step · #21 Post by **step** » Wed 15 Jan 2020, 06:50

Just a reminder to also test a Windows-created HTML file, for which \r\n is the line termination sequence. (I didn't but I remember being scorched about this before).

s243a · #22 Post by **s243a** » Wed 15 Jan 2020, 06:59

sc0ttman wrote:I'd love to get this working:

An HTML minifier...

This thing nearly does the job, except that it minifies stuff inside <pre> tags...

I would love love love to fix that!!
Code: Select all
function minify_html {
  # temp fix to IFS, just in case the hmtl files contain spaces
  OLD_IFS=$IFS
  IFS="
  "
  for html_file in $html_files
  do
    :
    # dont minify HTML until we can skip contents of <pre>..</pre>
    #sed ':a;N;$!ba;/<div class="highlight"><pre>\.*<\/pre><\/div>/! s@>\s*<@><@g' $html_file > ${html_file//.html/.minhtml}
    #mv ${html_file//.html/.minhtml} ${html_file}
  done
  IFS=$OLD_IFS
}

I'll think about the general problem more later but for now I notice that in "<pre>\.*<\/pre>", you are escaping the period but I think what you actual want is ""<pre>.*<\/pre>" (notice the period is not "not escaped") because even with "Basic Regular expressions" the period character still has it's special meaning and in this case we want it to have it's specail meaning so we don't want to escape it.

In GNU sed, the only difference between basic and extended regular expressions is in the behavior of a few special characters: ‘?’, ‘+’, parentheses, braces (‘{}’), and ‘|’.

With basic (BRE) syntax, these characters do not have special meaning unless prefixed with a backslash (‘\’); While with extended (ERE) syntax it is reversed: these characters are special unless they are prefixed with backslash (‘\’).

https://www.gnu.org/software/sed/manual ... BRE-vs-ERE

You can test this with something like:

# echo abc | sed 's/a.c//'

BTW, why do we need the "div" tags in the above expression?

sc0ttman · #23 Post by **sc0ttman** » Wed 15 Jan 2020, 07:10

s243a wrote:BTW, why do we need the "div" tags in the above expression?

mdsh/pygments generates divs arounds pre tags.. this minifier is for mdsh.

recobayu · #24 Post by **recobayu** » Wed 15 Jan 2020, 08:44

MochiMoppel wrote:Looks like another abandoned thread

I'll give it a try anyway since I don't know where to ask.
The challenge is to remove all comments from a XML/HTML document, using only sed.

Example text:
Code: Select all
<JWM>
	<Tray  autohide="false" insert="right" x="0" y="-1" border="1" height="28" >
		
		<TrayButton label="Menu" icon="logo-mini.png" border="true">root:3</TrayButton>
border="true">exec:urxvt</TrayButton>
		<Pager/>
		
		<TaskList maxwidth="200"/>
		<Dock/>
		
	
	
	
	
		<Swallow name="xload" width="32">
			xload -nolabel -bg "#888888" -fg red -hl white
		</Swallow>
		<Clock format="%H:%M">minixcal</Clock>
	</Tray>
</JWM>
The problem is that these comments can be multiline. My rough idea is to let sed move a line to the hold buffer when a ' is detexted', load the hold buffer into the pattern space and remove the comment, clear the hold buffer and continue with the next cycle. May not be the right way and I'm not even close to achieve the goal. Does anybody know how to do this?

I use this code, only one line code, but it just delete the  that if it is in different line.

Code: Select all

#sed -e '/<!--/,/-->/d' xml
<JWM>
   <Tray  autohide="false" insert="right" x="0" y="-1" border="1" height="28" >
      <TaskList maxwidth="200"/>
      <Dock/>
      <Swallow name="xload" width="32">
         xload -nolabel -bg "#888888" -fg red -hl white
      </Swallow>
      <Clock format="%H:%M">minixcal</Clock>
   </Tray>
</JWM>

MochiMoppel · #25 Post by **MochiMoppel** » Wed 15 Jan 2020, 12:07

Keef wrote:Is this any good?
https://stackoverflow.com/questions/405 ... ing-regexp

If it solves the problem it can't be that bad, right?

But frankly it's not really good: A useless cat, a useless '?' in  and a strange positioning of the :a label. Still the idea is nice. No pattern space <-> hold space acrobatics, just a clever use of a label. The next suggestion in your link is better;

Code: Select all

sed -r '
/<!--/!b
:a
/-->/! {N;ba}
s/<!--.*-->//
' "$TESTFILE"

sc0ttman wrote:This thing nearly does the job, except that it minifies stuff inside <pre> tags..

Which job? Where do you eliminate comments? And what "stuff inside <pre> tags". Do you want to preserve comments inside pre tags? What for? The browser wouldn't show them anyway, so you might delete them as well. Unless you provide a sample of your input it is hard to tell what you are after.

s243a wrote:I get the same output with:
Code: Select all
...some really ugly code here....

jamesbond wrote:Confirmed to work with gnu sed and busybox sed.

That's already a nice achievement. But here is my problem with all suggestions so far: They all assume that a line contains only 1 comment, which is a bold assumption. Surely I take the blame for not providing a better example and I will think of a better one. Generally speaking a XLM document is whitespace agnostic. Linefeeds don't matter and even a huge HTML page can be written as a single line (e.g. Goggle does this). A pattern like  is greedy and would eliminate everything from the first  instead of catching only the next comment termination.

step wrote:Just a reminder to also test a Windows-created HTML file, for which \r\n is the line termination sequence.

Thanks for the reminder. I can imagine that Mac documents are even more fun as sed probably would treat the whole document as a single line

@recobayu Thanks, but this is just too limited

@all I now cooked my own solutions, which appear to do what I want. I'll share them if they pass my acid tests. Let's see.

jamesbond · #26 Post by **jamesbond** » Wed 15 Jan 2020, 15:50

MochiMoppel wrote:That's already a nice achievement. But here is my problem with all suggestions so far: They all assume that a line contains only 1 comment, which is a bold assumption.

True enough.

Surely I take the blame for not providing a better example and I will think of a better one.

You did say it was for general HTML/XML. I will now take this to mean as __valid__ HTML/XML which does not allow nested comments.

My updated test case:

Code: Select all

<p>1</p>
<!--2-->3<br>
<!--
4
--><b>5</b><!--
-6 -7 --8 -9- <10> <-11-> <u>12</u>
-->13
<!-- 14 -->15<!-- <-16-> -->17

Expected output:

Code: Select all

<p>1</p>
3<br>
<b>5</b>13
1517

Here is my updated take on the challenge. Still works on busybox sed and gnu sed too.

Code: Select all

sed -r -e ':a;N;$!ba;s/<!--([^-]*|[^-]*-[^-]|[^-]*--[^>])*-->//g;' test.html

sc0ttman · #27 Post by **sc0ttman** » Wed 15 Jan 2020, 19:39

MochiMoppel wrote:
sc0ttman wrote:This thing nearly does the job, except that it minifies stuff inside <pre> tags..
Which job? Where do you eliminate comments? And what "stuff inside <pre> tags". Do you want to preserve comments inside pre tags? What for? The browser wouldn't show them anyway, so you might delete them as well. Unless you provide a sample of your input it is hard to tell what you are after.

Sorry, I should have been more clear, I'm posting "off topic" .. not even attempting to "remove comments"...

So.. I mean it "does the job" of minifying HTML.. Nothing to do with removing comments... Though it is related (I also want to remove comments at some point), hence me posting here..

So the snippet I posted does the job of minifying HTML, except that is _also_ minifies the contents of <pre> tags... which I don't want...

Carry on ....

s243a · #28 Post by **s243a** » Wed 15 Jan 2020, 20:44

sc0ttman wrote:
MochiMoppel wrote:
sc0ttman wrote:This thing nearly does the job, except that it minifies stuff inside <pre> tags..
Which job? Where do you eliminate comments? And what "stuff inside <pre> tags". Do you want to preserve comments inside pre tags? What for? The browser wouldn't show them anyway, so you might delete them as well. Unless you provide a sample of your input it is hard to tell what you are after.
Sorry, I should have been more clear, I'm posting "off topic" .. not even attempting to "remove comments"...

So.. I mean it "does the job" of minifying HTML.. Nothing to do with removing comments... Though it is related (I also want to remove comments at some point), hence me posting here..

So the snippet I posted does the job of minifying HTML, except that is _also_ minifies the contents of <pre> tags... which I don't want...

Carry on ....

Did you try my suggestion above, which was removing the backslash before the ".*" inside the pre tags? If you give us some test input then we can try some tests.

sc0ttman · #29 Post by **sc0ttman** » Wed 15 Jan 2020, 21:37

I didn't really try anything - it's not my snippet, and already way beyond anything I know about sed (next to nothing)...

And a valid test case would be any HTML file from mdsh that contains highlighted code like this one: https://sc0ttj.github.io/mdsh/posts/201 ... uages.html

jamesbond · #30 Post by **jamesbond** » Thu 16 Jan 2020, 03:19

Scott, I've read your few posts, and I still don't get it. Perhaps it's good if you can give us a sample input and the expected output, as the "incorrect output" as produced by the currently not-working script, so we can get an idea of what it is that you want to do. As it stands, the current sed script will more or less empties out text in-between html tags - leaving basically a blank page full of tags but no text in between. I'm not sure whether that counts as "minify". (I heard of minifying javascript, but minifying html is news to me ...).

s243a · #31 Post by **s243a** » Thu 16 Jan 2020, 03:28

jamesbond wrote:Scott, I've read your few posts, and I still don't get it. Perhaps it's good if you can give us a sample input and the expected output, as the "incorrect output" as produced by the currently not-working script, so we can get an idea of what it is that you want to do. As it stands, the current sed script will more or less empties out text in-between html tags - leaving basically a blank page full of tags but no text in between. I'm not sure whether that counts as "minify". (I heard of minifying javascript, but minifying html is news to me ...).

To me it looks like it will only empty out the space between tags if the space between tags is whitespace:

'\s'
Matches whitespace characters (spaces and tabs). Newlines embedded
in the pattern/hold spaces will also match:

https://www.gnu.org/software/sed/manual/sed.txt

However, if we are truly trying to minimize the HTML shouldn't we also delete the enclosing tags?

MochiMoppel · #32 Post by **MochiMoppel** » Thu 16 Jan 2020, 06:56

jamesbond wrote:Here is my updated take on the challenge. Still works on busybox sed and gnu sed too.
Code: Select all
sed -r -e ':a;N;$!ba;s///g;' test.html

Wow! Well done! That's an impressive pattern. Took me a while to digest.

At first it seemed perfect but it choked on these 2 variations of your test cases
Case1:

Code: Select all

<!-- 14 -->15<!-- <-16-> -->17

Expected output:

Code: Select all

Output received:

Code: Select all

<!-- 14 -->15<!-- <-16-> -->17

Case2:

Code: Select all

<p>1</p>
<!----NEW---->
text<!----NEW---->
<!--2-->3<br> 
<!-- 
4 
--><b>5</b><!-- 
-6 -7 --8 -9- <10> <-11-> <u>12</u> 
-->13 
<!-- 14 -->15<!-- <-16-> -->17

Expected output:

Code: Select all

<p>1</p>

text
3<br> 
<b>5</b>13 
1517

Output received:

Code: Select all

<p>1</p>
3<br> 
<b>5</b>13 
1517

My own homebrew may look less sophisticated but so far it passed all tests (well, as expected fails on Mac files. Can be fixed. For the time being let's keep it simple):

Code: Select all

sed $':a;$!{N;ba;};s/<!--/\x1/g;s/-->/\x2/g;s/\x1[^\x1]*\x2//g'  test.html

Now it's your turn to break it

Burunduk · #33 Post by **Burunduk** » Thu 16 Jan 2020, 20:49

Keef wrote:Is this any good?
https://stackoverflow.com/questions/405 ... ing-regexp

Yes, this link contains another link: https://catonmat.net/sed-one-liners-explained-part-one -- an interesting article I've never come across before. Thank you.

sc0ttman wrote: I'd love to get this working:

An HTML minifier...

This thing nearly does the job, except that it minifies stuff inside <pre> tags...

If I understand the task correctly, this sed script should remove gaps between tags as well as line feeds except inside (possibly nested) <pre> tags:

Code: Select all

sed ':a;$!{N;ba;};s/@/@a/g;s/\n/@n/g;s/<pre/\n&/g;s/<\/pre>/&\n/g' test.html \
  | sed -r '/(^<pre|<\/pre>$)/!{s/@n//g;s/>\s+</></g;}' \
  | sed ':a;$!{N;ba;};s/\n//g;s/@n/\n/g;s/@a/@/g' >min.html

Three sed commands in a row! I think this code itself needs to be minified.

jamesbond wrote:Here is my updated take on the challenge. Still works on busybox sed and gnu sed too.
Code: Select all
sed -r -e ':a;N;$!ba;s///g;' test.html

This is clever. It has a problem though and MochiMoppel's test revealed it. The 3rd alternative eats too many hyphens at the end of a comment:

Code: Select all

# echo '<!--remove-me-not--->' | sed -r -e ':a;N;$!ba;s/<!--([^-]*|[^-]*-[^-]|[^-]*--[^>])*-->//g;'
<!--remove-me-not--->

That can be fixed by adding any other character to break the serie of hyphens:

Code: Select all

sed -r -e ':a;$!{N;ba;};s/-->/@&/g;s/<!--(-?[^-]|--[^>])*-->//g;' test.htm

MochiMoppel wrote:
Code: Select all
sed $':a;$!{N;ba;};s//\x2/g;s/\x1[^\x1]*\x2//g'  test.html
Now it's your turn to break it

You know it's unbreakable!

Maybe just one unlikely-to-appear-in-the-input-file character is enough here:

Code: Select all

sed $':a;$!{N;ba;};s/-->/\x1/g;s/<!--[^\x1]*\x1//g'  test.html

And here is my own attempt (now obviously superfluous):

Code: Select all

sed ':a;$!{N;ba;};:c;/<!--/s/-->/&&/;s/<!--.*-->-->//;tc' test.html

jamesbond · #34 Post by **jamesbond** » Fri 17 Jan 2020, 15:41

@Mochi/@Burunduk: Thanks for the entertainment and insight. I admit defeat

Yours are simple, yet effective, and most importantly, clear and easy to understand.

(PS: @Mochi, it's easy to break yours - just pepper 0x01 and 0x02 in the html and those will get deleted when they shouldn't; but normal html files __won't__ have these in them so the point is moot - the script works as intended for normal HTML, so as far as normal HTML files are concerned, this is now a solved problem).

---

Next, we probably should tackle Scott's request (all the comments about deleting the whitespace are correct, I missed the "\s" in the script) if you guys still want to play

MochiMoppel · #35 Post by **MochiMoppel** » Sat 18 Jan 2020, 03:23

Burunduk wrote:You know it's unbreakable!

Yeah, as unbreakable as the windows of Elon Musk's Cybertruck

Don't you worry, I just broke it

Maybe just one unlikely-to-appear-in-the-input-file character is enough here:

You are right when you consider my flawed code. For making the code bullet-proof, as I originally intended, ! need the second character.

And here is my own attempt (now obviously superfluous):

Nothing is superfluous in this world. Thank you for introducing sed's t command. It shows that sed is able to perform while...do loops and helps to fix my code.

Let's raise the bar one notch higher and enclose text that contains a comment with another comment, effectively creating a nested comment. Nested comments may be invalid, but they are a fact of life. It's easy to select and comment out a large portion of XML text and not noticing when this block already contains comments. With the same ease it is possible to delete a portion and forget to include an opening or closing tag, thus creating an orphan.

In case of nested comments my code will leave orphans. With Burunduk's "superfluous" code it will even create an infinite loop and will not work at all.

So here is my attempt to fix all problems and build Cybertruck 2.0: (changes marked)

sed -r $':a;$!{N;ba;};s/\\r\\n?/\\n/g;s//\x2/g;:c;s/\x1[^\x1\x2]*\x2//g;tc;s/\x1|\x2//g' test.html

The first change converts any Mac or Win line endings to Unix style.
The second change uses Burunduk's loop idea and peels nested comment onions from inside out.
Lastly any orphans are removed.
That should do it. Unbreakable. Reminds me of my face masks ("Keeps out 99.9% of all viruses").Still leaves me with the chance to catch "only" every 1000th virus.

[EDIT]: This didn't last long. Script fails on Scott's linked page which contains a weird comment in the DOCTYPE section:

<html lang="en">

What is that? The browser sees a closing tag, my script sees an opening tag. OK, if I change the replacement order my script will behave like a browser and leave only <html lang="en"> uncommented:

sed -r $':a;$!{N;ba;};s/\\r\\n?/\\n/g;s/-->/\x2/g;s/<!--/\x1/g;:c;s/\x1[^\x1\x2]*\x2//g;tc;s/\x1|\x2//g' test.htm

jamesbond wrote:@Mochi, it's easy to break yours - just pepper 0x01 and 0x02 in the html

Pretty difficult to create intentionally or accidentally. I would consider this to be a corrupted file, in which case eliminating comments should be the least concern of the user

HerrBert · #36 Post by **HerrBert** » Sun 19 Jan 2020, 18:05

I did not read the whole thread, but maybe this could be of interest too:
http://sed.sourceforge.net/sed1line.txt

sc0ttman · #37 Post by **sc0ttman** » Sun 19 Jan 2020, 23:31

I completely forgot about the IE conditional comments - they shouldn't be removed, or minified ... Should be ignored..

I can't remember any other caveats...

But it does remind me to revisit the conditional comments and go with something a little simpler...

EDIT:

Yep, this seems to work for me (minifies the HTML):

Code: Select all

sed ':a;$!{N;ba;};s/@/@a/g;s/\n/@n/g;s/<pre/\n&/g;s/<\/pre>/&\n/g' test.html \
  | sed -r '/(^<pre|<\/pre>$)/!{s/@n//g;s/>\s+</></g;}' \
  | sed ':a;$!{N;ba;};s/\n//g;s/@n/\n/g;s/@a/@/g' >min.html

Thanks very much Burunduk

...now onto getting a sed based CSS minifier that can remove multi-line comments, based on the above..

This CSS minifer fails on appended and multi-line comments, and is probably crap in 10 other ways:

Code: Select all

    cat $css_bundle \
      | grep -v '/\*' \
      | tr -d '\n' \
      | sed -e '/\/\*/,/\*\//d' \
            -e 's/  / /g' \
            -e 's/ {/{/g' \
            -e 's/{ /{/g' \
            -e 's/ }/}/g' \
            -e 's/: /:/g' \
            -e 's/; /;/g' > "${css_file//.css/.min.css}"

($css_bundle is a space separated list of valid CSS files)

...I really need to go learn how sed actually works..

MochiMoppel · #38 Post by **MochiMoppel** » Mon 20 Jan 2020, 08:00

sc0ttman wrote:Yep, this seems to work for me (minifies the HTML

Looks wrong to me. Not Burunduk's fault as he probably assumed that you would like all linefeeds removed, except those in <pre> tags, which in well constructed HTML pages would be no problem. In the case of your test page linefeeds are present in <p> tags and must not be removed, otherwise your page produces text like

If you have TCC installed, you can evenembed C code in your Markdown

sc0ttman · #39 Post by **sc0ttman** » Mon 20 Jan 2020, 08:47

MochiMoppel wrote:
sc0ttman wrote:Yep, this seems to work for me (minifies the HTML
Looks wrong to me. Not Burunduk's fault as he probably assumed that you would like all linefeeds removed, except those in <pre> tags, which in well constructed HTML pages would be no problem. In the case of your test page linefeeds have also to be preserved within <p> tags (or preferably changed to spaces), otherwise your page produces text like
If you have TCC installed, you can evenembed C code in your Markdown

EDIT: It's prettier.js that removes the trailing spaces from the source Markdown.. I disabled it ..

Fixed.. rebuilt a local version of the page without that little annoyance... Probably
also improves the screen reader experience.

..Anyway, I spotted other issues Burunduks code has yesterday (not stripping whitespace inside <a> tags), but I can live with it as
it is TBH - HTML minification would only mainly be for huge pages (over 500kb of HTML or so) - but
obviously an improved answer would remove newlines outside of pre tags generally.

MochiMoppel · #40 Post by **MochiMoppel** » Mon 20 Jan 2020, 10:58

sc0ttman wrote:It's prettier.js that removes the trailing spaces from the source Markdown.. I disabled it ..

I have no clue what you are talking about. What trailing spaces?

..Anyway, I spotted other issues Burunduks code has yesterday (not stripping whitespace inside <a> tags), but I can live with it as

You mean the 10 spaces between consecutive <a> tags?

Code: Select all

          <a href="/mdsh/tags/seo.html">seo</a>,
          <a href="/mdsh/tags/shell.html">shell</a>,
          <a href="/mdsh/tags/xml.html">xml</a>,

This looks like garbage and is not removed because Burunduk may have tried to guess your requirements from your first post. Your original script (s@>\s*<@><@g) was designed to remove pure whitespace between tags, i.e. spaces, tabs or linefeeds, no other characters. You said that this is what you want, except that you don't want to apply this to <pre> tags. This is basically what Burunduk delivered.. As soon as you put any other character between tags, even a single comma, nothing is or should be removed. Apart from the funny <a> tag spacings there is more questionable code , e.g. the seemingly useless '<span></span>' combos, that could be removed.

Wouldn't it be much more effective if you clean the HTML code first? With a clean HTML design there will be not much left to do for a minify script.

(old)Puppy Linux Discussion Forum

(old)Puppy Linux Discussion Forum

sed scratch pad -- A thread of sed examples

Re: html minifier in sed

Re: html minifier in sed