sed scratch pad -- A thread of sed examples
Just a reminder to also test a Windows-created HTML file, for which \r\n is the line termination sequence. (I didn't but I remember being scorched about this before).
[url=http://murga-linux.com/puppy/viewtopic.php?t=117546]Fatdog64-810[/url]|[url=http://goo.gl/hqZtiB]+Packages[/url]|[url=http://goo.gl/6dbEzT]Kodi[/url]|[url=http://goo.gl/JQC4Vz]gtkmenuplus[/url]
Re: html minifier in sed
I'll think about the general problem more later but for now I notice that in "<pre>\.*<\/pre>", you are escaping the period but I think what you actual want is ""<pre>.*<\/pre>" (notice the period is not "not escaped") because even with "Basic Regular expressions" the period character still has it's special meaning and in this case we want it to have it's specail meaning so we don't want to escape it.sc0ttman wrote:I'd love to get this working:
An HTML minifier...
This thing nearly does the job, except that it minifies stuff inside <pre> tags...
I would love love love to fix that!!
Code: Select all
function minify_html { # temp fix to IFS, just in case the hmtl files contain spaces OLD_IFS=$IFS IFS=" " for html_file in $html_files do : # dont minify HTML until we can skip contents of <pre>..</pre> #sed ':a;N;$!ba;/<div class="highlight"><pre>\.*<\/pre><\/div>/! s@>\s*<@><@g' $html_file > ${html_file//.html/.minhtml} #mv ${html_file//.html/.minhtml} ${html_file} done IFS=$OLD_IFS }
https://www.gnu.org/software/sed/manual ... BRE-vs-EREIn GNU sed, the only difference between basic and extended regular expressions is in the behavior of a few special characters: ‘?’, ‘+’, parentheses, braces (‘{}’), and ‘|’.
With basic (BRE) syntax, these characters do not have special meaning unless prefixed with a backslash (‘\’); While with extended (ERE) syntax it is reversed: these characters are special unless they are prefixed with backslash (‘\’).
You can test this with something like:
BTW, why do we need the "div" tags in the above expression?# echo abc | sed 's/a.c//'
Find me on [url=https://www.minds.com/ns_tidder]minds[/url] and on [url=https://www.pearltrees.com/s243a/puppy-linux/id12399810]pearltrees[/url].
Re: html minifier in sed
mdsh/pygments generates divs arounds pre tags.. this minifier is for mdsh.s243a wrote:BTW, why do we need the "div" tags in the above expression?
[b][url=https://bit.ly/2KjtxoD]Pkg[/url], [url=https://bit.ly/2U6dzxV]mdsh[/url], [url=https://bit.ly/2G49OE8]Woofy[/url], [url=http://goo.gl/bzBU1]Akita[/url], [url=http://goo.gl/SO5ug]VLC-GTK[/url], [url=https://tiny.cc/c2hnfz]Search[/url][/b]
I use this code, only one line code, but it just delete the <!-- and --> that if it is in different line.MochiMoppel wrote:Looks like another abandoned thread
I'll give it a try anyway since I don't know where to ask.
The challenge is to remove all comments from a XML/HTML document, using only sed.
Example text:The problem is that these comments can be multiline. My rough idea is to let sed move a line to the hold buffer when a '<!--' tag is detected, then continue to fill the hold buffer until a '--> is detexted', load the hold buffer into the pattern space and remove the comment, clear the hold buffer and continue with the next cycle. May not be the right way and I'm not even close to achieve the goal. Does anybody know how to do this?Code: Select all
<JWM> <Tray autohide="false" insert="right" x="0" y="-1" border="1" height="28" > <!-- Additional TrayButton attribute: label --> <TrayButton label="Menu" icon="logo-mini.png" border="true">root:3</TrayButton> border="true">exec:urxvt</TrayButton> <Pager/> <!-- Additional TaskList attribute: maxwidth --> <TaskList maxwidth="200"/> <Dock/> <!-- Additional Swallow attribute: height --> <!-- <Swallow name="blinky"> blinkydelayed -bg "#DCDAD5" </Swallow> --> <!-- <Swallow name="xtmix-launcher"> xtmix -launch </Swallow> --> <!-- <Swallow name="asapm"> asapmshell -u 4 </Swallow> --> <!-- <Swallow name="freememapplet" width="34"> freememappletshell </Swallow> --> <Swallow name="xload" width="32"> xload -nolabel -bg "#888888" -fg red -hl white </Swallow> <Clock format="%H:%M">minixcal</Clock> </Tray> </JWM>
Code: Select all
#sed -e '/<!--/,/-->/d' xml
<JWM>
<Tray autohide="false" insert="right" x="0" y="-1" border="1" height="28" >
<TaskList maxwidth="200"/>
<Dock/>
<Swallow name="xload" width="32">
xload -nolabel -bg "#888888" -fg red -hl white
</Swallow>
<Clock format="%H:%M">minixcal</Clock>
</Tray>
</JWM>
- MochiMoppel
- Posts: 2084
- Joined: Wed 26 Jan 2011, 09:06
- Location: Japan
If it solves the problem it can't be that bad, right?Keef wrote:Is this any good?
https://stackoverflow.com/questions/405 ... ing-regexp
But frankly it's not really good: A useless cat, a useless '?' in <!--.*?--> and a strange positioning of the :a label. Still the idea is nice. No pattern space <-> hold space acrobatics, just a clever use of a label. The next suggestion in your link is better;
Code: Select all
sed -r '
/<!--/!b
:a
/-->/! {N;ba}
s/<!--.*-->//
' "$TESTFILE"
Which job? Where do you eliminate comments? And what "stuff inside <pre> tags". Do you want to preserve comments inside pre tags? What for? The browser wouldn't show them anyway, so you might delete them as well. Unless you provide a sample of your input it is hard to tell what you are after.sc0ttman wrote:This thing nearly does the job, except that it minifies stuff inside <pre> tags..
s243a wrote:I get the same output with:Code: Select all
...some really ugly code here....
That's already a nice achievement. But here is my problem with all suggestions so far: They all assume that a line contains only 1 comment, which is a bold assumption. Surely I take the blame for not providing a better example and I will think of a better one. Generally speaking a XLM document is whitespace agnostic. Linefeeds don't matter and even a huge HTML page can be written as a single line (e.g. Goggle does this). A pattern like <!--.*--> is greedy and would eliminate everything from the first <!-- up to the last --> instead of catching only the next comment termination.jamesbond wrote:Confirmed to work with gnu sed and busybox sed.
Thanks for the reminder. I can imagine that Mac documents are even more fun as sed probably would treat the whole document as a single linestep wrote:Just a reminder to also test a Windows-created HTML file, for which \r\n is the line termination sequence.
@recobayu Thanks, but this is just too limited
@all I now cooked my own solutions, which appear to do what I want. I'll share them if they pass my acid tests. Let's see.
True enough.MochiMoppel wrote:That's already a nice achievement. But here is my problem with all suggestions so far: They all assume that a line contains only 1 comment, which is a bold assumption.
You did say it was for general HTML/XML. I will now take this to mean as __valid__ HTML/XML which does not allow nested comments.Surely I take the blame for not providing a better example and I will think of a better one.
My updated test case:
Code: Select all
<p>1</p>
<!--2-->3<br>
<!--
4
--><b>5</b><!--
-6 -7 --8 -9- <10> <-11-> <u>12</u>
-->13
<!-- 14 -->15<!-- <-16-> -->17
Code: Select all
<p>1</p>
3<br>
<b>5</b>13
1517
Code: Select all
sed -r -e ':a;N;$!ba;s/<!--([^-]*|[^-]*-[^-]|[^-]*--[^>])*-->//g;' test.html
Fatdog64 forum links: [url=http://murga-linux.com/puppy/viewtopic.php?t=117546]Latest version[/url] | [url=https://cutt.ly/ke8sn5H]Contributed packages[/url] | [url=https://cutt.ly/se8scrb]ISO builder[/url]
Sorry, I should have been more clear, I'm posting "off topic" .. not even attempting to "remove comments"...MochiMoppel wrote:Which job? Where do you eliminate comments? And what "stuff inside <pre> tags". Do you want to preserve comments inside pre tags? What for? The browser wouldn't show them anyway, so you might delete them as well. Unless you provide a sample of your input it is hard to tell what you are after.sc0ttman wrote:This thing nearly does the job, except that it minifies stuff inside <pre> tags..
So.. I mean it "does the job" of minifying HTML.. Nothing to do with removing comments... Though it is related (I also want to remove comments at some point), hence me posting here..
So the snippet I posted does the job of minifying HTML, except that is _also_ minifies the contents of <pre> tags... which I don't want...
Carry on ....
[b][url=https://bit.ly/2KjtxoD]Pkg[/url], [url=https://bit.ly/2U6dzxV]mdsh[/url], [url=https://bit.ly/2G49OE8]Woofy[/url], [url=http://goo.gl/bzBU1]Akita[/url], [url=http://goo.gl/SO5ug]VLC-GTK[/url], [url=https://tiny.cc/c2hnfz]Search[/url][/b]
Did you try my suggestion above, which was removing the backslash before the ".*" inside the pre tags? If you give us some test input then we can try some tests.sc0ttman wrote:Sorry, I should have been more clear, I'm posting "off topic" .. not even attempting to "remove comments"...MochiMoppel wrote:Which job? Where do you eliminate comments? And what "stuff inside <pre> tags". Do you want to preserve comments inside pre tags? What for? The browser wouldn't show them anyway, so you might delete them as well. Unless you provide a sample of your input it is hard to tell what you are after.sc0ttman wrote:This thing nearly does the job, except that it minifies stuff inside <pre> tags..
So.. I mean it "does the job" of minifying HTML.. Nothing to do with removing comments... Though it is related (I also want to remove comments at some point), hence me posting here..
So the snippet I posted does the job of minifying HTML, except that is _also_ minifies the contents of <pre> tags... which I don't want...
Carry on ....
Find me on [url=https://www.minds.com/ns_tidder]minds[/url] and on [url=https://www.pearltrees.com/s243a/puppy-linux/id12399810]pearltrees[/url].
I didn't really try anything - it's not my snippet, and already way beyond anything I know about sed (next to nothing)...
And a valid test case would be any HTML file from mdsh that contains highlighted code like this one: https://sc0ttj.github.io/mdsh/posts/201 ... uages.html
And a valid test case would be any HTML file from mdsh that contains highlighted code like this one: https://sc0ttj.github.io/mdsh/posts/201 ... uages.html
[b][url=https://bit.ly/2KjtxoD]Pkg[/url], [url=https://bit.ly/2U6dzxV]mdsh[/url], [url=https://bit.ly/2G49OE8]Woofy[/url], [url=http://goo.gl/bzBU1]Akita[/url], [url=http://goo.gl/SO5ug]VLC-GTK[/url], [url=https://tiny.cc/c2hnfz]Search[/url][/b]
Scott, I've read your few posts, and I still don't get it. Perhaps it's good if you can give us a sample input and the expected output, as the "incorrect output" as produced by the currently not-working script, so we can get an idea of what it is that you want to do. As it stands, the current sed script will more or less empties out text in-between html tags - leaving basically a blank page full of tags but no text in between. I'm not sure whether that counts as "minify". (I heard of minifying javascript, but minifying html is news to me ...).
Fatdog64 forum links: [url=http://murga-linux.com/puppy/viewtopic.php?t=117546]Latest version[/url] | [url=https://cutt.ly/ke8sn5H]Contributed packages[/url] | [url=https://cutt.ly/se8scrb]ISO builder[/url]
To me it looks like it will only empty out the space between tags if the space between tags is whitespace:jamesbond wrote:Scott, I've read your few posts, and I still don't get it. Perhaps it's good if you can give us a sample input and the expected output, as the "incorrect output" as produced by the currently not-working script, so we can get an idea of what it is that you want to do. As it stands, the current sed script will more or less empties out text in-between html tags - leaving basically a blank page full of tags but no text in between. I'm not sure whether that counts as "minify". (I heard of minifying javascript, but minifying html is news to me ...).
https://www.gnu.org/software/sed/manual/sed.txt'\s'
Matches whitespace characters (spaces and tabs). Newlines embedded
in the pattern/hold spaces will also match:
However, if we are truly trying to minimize the HTML shouldn't we also delete the enclosing tags?
Find me on [url=https://www.minds.com/ns_tidder]minds[/url] and on [url=https://www.pearltrees.com/s243a/puppy-linux/id12399810]pearltrees[/url].
- MochiMoppel
- Posts: 2084
- Joined: Wed 26 Jan 2011, 09:06
- Location: Japan
Wow! Well done! That's an impressive pattern. Took me a while to digest.jamesbond wrote:Here is my updated take on the challenge. Still works on busybox sed and gnu sed too.Code: Select all
sed -r -e ':a;N;$!ba;s/<!--([^-]*|[^-]*-[^-]|[^-]*--[^>])*-->//g;' test.html
At first it seemed perfect but it choked on these 2 variations of your test cases
Case1:
Code: Select all
<!-- 14 -->15<!-- <-16-> -->17
Code: Select all
1517
Code: Select all
<!-- 14 -->15<!-- <-16-> -->17
Code: Select all
<p>1</p>
<!----NEW---->
text<!----NEW---->
<!--2-->3<br>
<!--
4
--><b>5</b><!--
-6 -7 --8 -9- <10> <-11-> <u>12</u>
-->13
<!-- 14 -->15<!-- <-16-> -->17
Code: Select all
<p>1</p>
text
3<br>
<b>5</b>13
1517
Code: Select all
<p>1</p>
3<br>
<b>5</b>13
1517
Code: Select all
sed $':a;$!{N;ba;};s/<!--/\x1/g;s/-->/\x2/g;s/\x1[^\x1]*\x2//g' test.html
Yes, this link contains another link: https://catonmat.net/sed-one-liners-explained-part-one -- an interesting article I've never come across before. Thank you.Keef wrote:Is this any good?
https://stackoverflow.com/questions/405 ... ing-regexp
If I understand the task correctly, this sed script should remove gaps between tags as well as line feeds except inside (possibly nested) <pre> tags:sc0ttman wrote: I'd love to get this working:
An HTML minifier...
This thing nearly does the job, except that it minifies stuff inside <pre> tags...
Code: Select all
sed ':a;$!{N;ba;};s/@/@a/g;s/\n/@n/g;s/<pre/\n&/g;s/<\/pre>/&\n/g' test.html \
| sed -r '/(^<pre|<\/pre>$)/!{s/@n//g;s/>\s+</></g;}' \
| sed ':a;$!{N;ba;};s/\n//g;s/@n/\n/g;s/@a/@/g' >min.html
This is clever. It has a problem though and MochiMoppel's test revealed it. The 3rd alternative eats too many hyphens at the end of a comment:jamesbond wrote:Here is my updated take on the challenge. Still works on busybox sed and gnu sed too.
Code: Select all
sed -r -e ':a;N;$!ba;s/<!--([^-]*|[^-]*-[^-]|[^-]*--[^>])*-->//g;' test.html
Code: Select all
# echo '<!--remove-me-not--->' | sed -r -e ':a;N;$!ba;s/<!--([^-]*|[^-]*-[^-]|[^-]*--[^>])*-->//g;'
<!--remove-me-not--->
Code: Select all
sed -r -e ':a;$!{N;ba;};s/-->/@&/g;s/<!--(-?[^-]|--[^>])*-->//g;' test.htm
You know it's unbreakable! Maybe just one unlikely-to-appear-in-the-input-file character is enough here:MochiMoppel wrote:Code: Select all
sed $':a;$!{N;ba;};s/<!--/\x1/g;s/-->/\x2/g;s/\x1[^\x1]*\x2//g' test.html
Now it's your turn to break it
Code: Select all
sed $':a;$!{N;ba;};s/-->/\x1/g;s/<!--[^\x1]*\x1//g' test.html
And here is my own attempt (now obviously superfluous):
Code: Select all
sed ':a;$!{N;ba;};:c;/<!--/s/-->/&&/;s/<!--.*-->-->//;tc' test.html
@Mochi/@Burunduk: Thanks for the entertainment and insight. I admit defeat
Yours are simple, yet effective, and most importantly, clear and easy to understand.
(PS: @Mochi, it's easy to break yours - just pepper 0x01 and 0x02 in the html and those will get deleted when they shouldn't; but normal html files __won't__ have these in them so the point is moot - the script works as intended for normal HTML, so as far as normal HTML files are concerned, this is now a solved problem).
---
Next, we probably should tackle Scott's request (all the comments about deleting the whitespace are correct, I missed the "\s" in the script) if you guys still want to play
Yours are simple, yet effective, and most importantly, clear and easy to understand.
(PS: @Mochi, it's easy to break yours - just pepper 0x01 and 0x02 in the html and those will get deleted when they shouldn't; but normal html files __won't__ have these in them so the point is moot - the script works as intended for normal HTML, so as far as normal HTML files are concerned, this is now a solved problem).
---
Next, we probably should tackle Scott's request (all the comments about deleting the whitespace are correct, I missed the "\s" in the script) if you guys still want to play
Fatdog64 forum links: [url=http://murga-linux.com/puppy/viewtopic.php?t=117546]Latest version[/url] | [url=https://cutt.ly/ke8sn5H]Contributed packages[/url] | [url=https://cutt.ly/se8scrb]ISO builder[/url]
- MochiMoppel
- Posts: 2084
- Joined: Wed 26 Jan 2011, 09:06
- Location: Japan
Yeah, as unbreakable as the windows of Elon Musk's Cybertruck Don't you worry, I just broke itBurunduk wrote:You know it's unbreakable!
You are right when you consider my flawed code. For making the code bullet-proof, as I originally intended, ! need the second character.Maybe just one unlikely-to-appear-in-the-input-file character is enough here:
Nothing is superfluous in this world. Thank you for introducing sed's t command. It shows that sed is able to perform while...do loops and helps to fix my code.And here is my own attempt (now obviously superfluous):
Let's raise the bar one notch higher and enclose text that contains a comment with another comment, effectively creating a nested comment. Nested comments may be invalid, but they are a fact of life. It's easy to select and comment out a large portion of XML text and not noticing when this block already contains comments. With the same ease it is possible to delete a portion and forget to include an opening or closing tag, thus creating an orphan.
In case of nested comments my code will leave orphans. With Burunduk's "superfluous" code it will even create an infinite loop and will not work at all.
So here is my attempt to fix all problems and build Cybertruck 2.0: (changes marked)
- sed -r $':a;$!{N;ba;};s/\\r\\n?/\\n/g;s/<!--/\x1/g;s/-->/\x2/g;:c;s/\x1[^\x1\x2]*\x2//g;tc;s/\x1|\x2//g' test.html
The second change uses Burunduk's loop idea and peels nested comment onions from inside out.
Lastly any orphans are removed.
That should do it. Unbreakable. Reminds me of my face masks ("Keeps out 99.9% of all viruses").Still leaves me with the chance to catch "only" every 1000th virus.
[EDIT]: This didn't last long. Script fails on Scott's linked page which contains a weird comment in the DOCTYPE section:
- <!--[if gte IE 11]><!--><html lang="en"><!--<![endif]-->
- sed -r $':a;$!{N;ba;};s/\\r\\n?/\\n/g;s/-->/\x2/g;s/<!--/\x1/g;:c;s/\x1[^\x1\x2]*\x2//g;tc;s/\x1|\x2//g' test.htm
Pretty difficult to create intentionally or accidentally. I would consider this to be a corrupted file, in which case eliminating comments should be the least concern of the userjamesbond wrote:@Mochi, it's easy to break yours - just pepper 0x01 and 0x02 in the html
I did not read the whole thread, but maybe this could be of interest too:
http://sed.sourceforge.net/sed1line.txt
http://sed.sourceforge.net/sed1line.txt
I completely forgot about the IE conditional comments - they shouldn't be removed, or minified ... Should be ignored..
I can't remember any other caveats...
But it does remind me to revisit the conditional comments and go with something a little simpler...
EDIT:
Yep, this seems to work for me (minifies the HTML):
Thanks very much Burunduk
...now onto getting a sed based CSS minifier that can remove multi-line comments, based on the above..
This CSS minifer fails on appended and multi-line comments, and is probably crap in 10 other ways:
($css_bundle is a space separated list of valid CSS files)
...I really need to go learn how sed actually works..
I can't remember any other caveats...
But it does remind me to revisit the conditional comments and go with something a little simpler...
EDIT:
Yep, this seems to work for me (minifies the HTML):
Code: Select all
sed ':a;$!{N;ba;};s/@/@a/g;s/\n/@n/g;s/<pre/\n&/g;s/<\/pre>/&\n/g' test.html \
| sed -r '/(^<pre|<\/pre>$)/!{s/@n//g;s/>\s+</></g;}' \
| sed ':a;$!{N;ba;};s/\n//g;s/@n/\n/g;s/@a/@/g' >min.html
...now onto getting a sed based CSS minifier that can remove multi-line comments, based on the above..
This CSS minifer fails on appended and multi-line comments, and is probably crap in 10 other ways:
Code: Select all
cat $css_bundle \
| grep -v '/\*' \
| tr -d '\n' \
| sed -e '/\/\*/,/\*\//d' \
-e 's/ / /g' \
-e 's/ {/{/g' \
-e 's/{ /{/g' \
-e 's/ }/}/g' \
-e 's/: /:/g' \
-e 's/; /;/g' > "${css_file//.css/.min.css}"
...I really need to go learn how sed actually works..
[b][url=https://bit.ly/2KjtxoD]Pkg[/url], [url=https://bit.ly/2U6dzxV]mdsh[/url], [url=https://bit.ly/2G49OE8]Woofy[/url], [url=http://goo.gl/bzBU1]Akita[/url], [url=http://goo.gl/SO5ug]VLC-GTK[/url], [url=https://tiny.cc/c2hnfz]Search[/url][/b]
- MochiMoppel
- Posts: 2084
- Joined: Wed 26 Jan 2011, 09:06
- Location: Japan
Looks wrong to me. Not Burunduk's fault as he probably assumed that you would like all linefeeds removed, except those in <pre> tags, which in well constructed HTML pages would be no problem. In the case of your test page linefeeds are present in <p> tags and must not be removed, otherwise your page produces text likesc0ttman wrote:Yep, this seems to work for me (minifies the HTML
If you have TCC installed, you can evenembed C code in your Markdown
EDIT: It's prettier.js that removes the trailing spaces from the source Markdown.. I disabled it ..MochiMoppel wrote:Looks wrong to me. Not Burunduk's fault as he probably assumed that you would like all linefeeds removed, except those in <pre> tags, which in well constructed HTML pages would be no problem. In the case of your test page linefeeds have also to be preserved within <p> tags (or preferably changed to spaces), otherwise your page produces text likesc0ttman wrote:Yep, this seems to work for me (minifies the HTMLIf you have TCC installed, you can evenembed C code in your Markdown
Fixed.. rebuilt a local version of the page without that little annoyance... Probably
also improves the screen reader experience.
..Anyway, I spotted other issues Burunduks code has yesterday (not stripping whitespace inside <a> tags), but I can live with it as
it is TBH - HTML minification would only mainly be for huge pages (over 500kb of HTML or so) - but
obviously an improved answer would remove newlines outside of pre tags generally.
[b][url=https://bit.ly/2KjtxoD]Pkg[/url], [url=https://bit.ly/2U6dzxV]mdsh[/url], [url=https://bit.ly/2G49OE8]Woofy[/url], [url=http://goo.gl/bzBU1]Akita[/url], [url=http://goo.gl/SO5ug]VLC-GTK[/url], [url=https://tiny.cc/c2hnfz]Search[/url][/b]
- MochiMoppel
- Posts: 2084
- Joined: Wed 26 Jan 2011, 09:06
- Location: Japan
I have no clue what you are talking about. What trailing spaces?sc0ttman wrote:It's prettier.js that removes the trailing spaces from the source Markdown.. I disabled it ..
You mean the 10 spaces between consecutive <a> tags?..Anyway, I spotted other issues Burunduks code has yesterday (not stripping whitespace inside <a> tags), but I can live with it as
Code: Select all
<a href="/mdsh/tags/seo.html">seo</a>,
<a href="/mdsh/tags/shell.html">shell</a>,
<a href="/mdsh/tags/xml.html">xml</a>,
Wouldn't it be much more effective if you clean the HTML code first? With a clean HTML design there will be not much left to do for a minify script.