AWK Based Version Comparison

Message

s243a · #1 Post by **s243a** » Mon 23 Sep 2019, 20:25

Intro

The woof-CE team has noticed how Jamesbond's repo database parser is faster than the puppy package manager at parsing this database. Aside from speed, another advantage that AWK has the advantage over a language like Perl and Python is that it is more stable and more frequently available in a system.

I'm exploring the idea of doing version comparison in AWK so that we can filter the database to includes results that are only within a given version range. Note that there is a difference between lib version as per the so-name of the primary lib within a package (e.g. libc6) which has a lib version of 6 vs the package version (e.g. glibc-2.24) (See post).

We could build tools to construct a database which equates the lib version with the package version either by downloading packages and using ldd or in some cases (e.g. Debian Repos) extracting the lib version from the package name.

Anyway, in my version comparison tool, that I'm working on we strip the lib version which searching for repo records.

Bash Code:

Code: Select all

stripped="$(echo $pkg_name | sed -e 's/[0-9]*$//g')"

AWK code:

Code: Select all

           match(\$2,/^(.*[^[:digit:]])([[:digit:]]*$|$)/,pkg_split)
           if ( pkg_split[1] == \"$stripped\" $CMP_Function ) {
             print

Methodology

My mythology here is to build my AWK code in a Bash function called, "mk_AWK_prg()"

Version comparisons are based from the input to this function and used to create an array that will be used in the AWK function

Code: Select all

    declare -a options="$(getopt -o f:l:m:np:su:v: --long full-version:,lib-version,min-version,--no-strip:,package:,stripped,version:,gt:,ge:,lt:,le: -- $@)"
    eval set --"$options"
    while [ $# -gt 0 ]; do
      case "$1" in
...
      -m|--min-version|--ge)
         n_cmp=$((n_cmp+1))
         awk_cmp_ary_op="$awk_cmp_ary_op$'\n'awk_cmp_ary_op[$n_cmp]=\"ge\""
         awk_cmp_ary_val="$awk_cmp_ary_val$'\n'awk_cmp_ary_val[$n_cmp]=\"$2\"" 
         shift 2  ;;

The version comparison function uses regular expressions to parse the package version:

Code: Select all

      AWK_Functions="\
         function version_split(version1,version_array,split_chars       i1,remainder1,matches){
           match(num1,/[[:digit:]]*:(.*|$)/,num1_epoch_split)
           if (length(num1_epoch_split) > 0 ){
             version_array[1]=num1_epoch_split[1]
             remainder1=num1_epoch_split[2]
           } else {
             version_array[1]=0
             remainder1=version1
           }
           split_chars[1]=\":\"
           i1=2
           match(remainder1,/^([^+-~:])(([+.~])([^+.~-:]+))*(([-])([^+.~-:])+)?$/,matches)  
           version_array[i1]=matches[i2]
           
           for (i2 = 4; i<length(matches); i2=i2+3){
             i1=i1+1
             version_array[i1]=matches[i2]
             split_chars[i1-1]=matches[i2-1]
           }
         }
...

see: https://linux.die.net/man/5/deb-version

Note that the epoch is split first because we want to successively match up the other delimiters.

We test each element in the array of version comparisons against the repo record. If any test fails we return '0' Otherwise we return '1'.

AWK Code (part of the bash string "AWK_Functions")

Code: Select all

         arry_cmp(version,ops_array,val_array){
           for (i=1; i<=length(ops_array)){
             #https://www.gnu.org/software/gawk/manual/gawk.html#Switch-Statement
             switch(ops_array){
             case \"<\":
             case \"lt\":
               if (v_lt(version,val_array[i]) == 0 ){
                 return 0
               }
               break

If any versions comparisons were supplied as inputs to mk_AWK_prg() then we add AWK_Functions and the related arrays to our code. Otherwise these functions aren't added to the AWK code.

Code: Select all

    CMP_Function="&& arry_cmp(\$3,awk_cmp_ary_op,awk_cmp_ary_val)"$'\n'
    else
      AWK_Functions=''
      CMP_Function=''
    fi
   
    #https://www.gnu.org/software/gawk/manual/html_node/String-Functions.html
    if [ $stripped_match -eq 1 ]
        AWK_PRG=$AWK_Functions \
"        BEGIN{FS=\"|\"
         $awk_cmp_ary_op
         $awk_cmp_ary_val  
       }
       {
         if( \$2 == \"$pkg_name\" $CMP_Function ) {
           print
         }
        else
           match(\$2,/^(.*[^[:digit:]])([[:digit:]]*$|$)/,pkg_split)
           if ( pkg_split[1] == \"$stripped\" $CMP_Function ) {
             print
        }
     }"

Note that: this code is very preliminary with untested changes..

Further Work
1. Aside from the epoch, I'm not making much of a distinction based on the type of version separator. This will require further research. However, in most cases the version separator that we will be primary interested is the period (i.e. '.'), Therefore the code should be useful prior to me doing this investigation.
2. .... (more to follow)

Link to preliminary code: https://pastebin.com/wXQ6uLwY

musher0 · #2 Post by **musher0** » Mon 23 Sep 2019, 23:24

Hi, s243a.

Sorry for being so dumb, but you wish to compare the version of what with what,
exactly?

Within what? The PPM database?

If you wish to compare the versions of execs and libs from one Puppy to the next,
I hope you have lots of time on your hands.

I at least got that you do NOT want to check the version of AWK itself. Phew!

That said, if you really are after speed, use mawk, not the (usually) native gawk.
Ref.: https://brenocon.com/blog/2009/09/dont- ... g-language

Respectfully.

s243a · #3 Post by **s243a** » Tue 24 Sep 2019, 01:04

musher0 wrote:Hi, s243a.

Sorry for being so dumb, but you wish to compare the version of what with what,
exactly?

If we look at the repo db record for bash we see:

Code: Select all

bash_4.4-5|bash|4.4-5||BuildingBlock|5949K|pool/DEBIAN/main/b/bash|bash_4.4-5_i386.deb|+base-files&ge2.1.12,+debianutils&ge2.15|GNU Bourne Again SHell|devuan|ascii|

We see for dependencies that 'base-files must be greater than or equal to 2.1.12 and debianutils must be greater than 2.15. So if we are looking for dependencies we might want to verify the version meets these requirements before installing it.

Within what? The PPM database?

If you wish to compare the versions of execs and libs from one Puppy to the next,
I hope you have lots of time on your hands.

I want to be able to mix different repos. Puppy actually does this because we have Package-puppy-common64-official and Package-puppy-common32-official but these are puppy specific binaries are compiled in such a way to be widely compatible.

I have two rules of thub which are:
1. repos which appear first in the list have priority
2. stick primary to binary compatible repos (e.g. simmilar versions of glibc)

The repo which I give primary priority are those of the binary comptable distro. However, say we end up installing a package from a repo of lesser priority then which versions of the dependencies should we install. My default preference is to install the dependencies which have versions matching the binary compatible distro (when available) but perhaps this behaviour can be configured via settings and where available other information can be used to resolve conflicts (e.g. the more stringent version range).

I at least got that you do NOT want to check the version of AWK itself. Phew!
That said, if you really are after speed, use mawk, not the (usually) native gawk.
Ref.: https://brenocon.com/blog/2009/09/dont- ... g-language

Respectfully.

Thankyou for the tip. I don't know if this will be helpful to me or not because I'm probably using gawk features that aren't available in mawk. However, it might be more efficient to pre-process with mawk.

musher0 · #4 Post by **musher0** » Tue 24 Sep 2019, 02:34

s243a wrote:
musher0 wrote:Hi, s243a.

Sorry for being so dumb, but you wish to compare the version of what with what,
exactly?
If we look at the repo db record for bash we see:
Code: Select all
bash_4.4-5|bash|4.4-5||BuildingBlock|5949K|pool/DEBIAN/main/b/bash|bash_4.4-5_i386.deb|+base-files&ge2.1.12,+debianutils&ge2.15|GNU Bourne Again SHell|devuan|ascii|
We see for dependencies that 'base-files must be greater than or equal to 2.1.12 and debianutils must be greater than 2.15. So if we are looking for dependencies we might want to verify the version meets these requirements before installing it.

Within what? The PPM database?

If you wish to compare the versions of execs and libs from one Puppy to the next,
I hope you have lots of time on your hands.
I want to be able to mix different repos. Puppy actually does this because we have Package-puppy-common64-official and Package-puppy-common32-official but these are puppy specific binaries are compiled in such a way to be widely compatible.

I have two rules of thub which are:
1. repos which appear first in the list have priority
2. stick primary to binary compatible repos (e.g. simmilar versions of glibc)

The repo which I give primary priority are those of the binary comptable distro. However, say we end up installing a package from a repo of lesser priority then which versions of the dependencies should we install. My default preference is to install the dependencies which have versions matching the binary compatible distro (when available) but perhaps this behaviour can be configured via settings and where available other information can be used to resolve conflicts (e.g. the more stringent version range).

I at least got that you do NOT want to check the version of AWK itself. Phew!
That said, if you really are after speed, use mawk, not the (usually) native gawk.
Ref.: https://brenocon.com/blog/2009/09/dont- ... g-language

Respectfully.
Thankyou for the tip. I don't know if this will be helpful to me or not because I'm probably using gawk features that aren't available in mawk. However, it might be more efficient to pre-process with mawk.

Thanks for the example. That clears things up a bit.

As to the GNU EXTENSIONS, there is a list at
http://man7.org/linux/man-pages/man1/gawk.1.html
May I suggest working with < gawk --traditional > to avoid
gawk peculiarities in the code.

BFN.

sc0ttman · #5 Post by **sc0ttman** » Tue 24 Sep 2019, 21:52

s243a wrote:I'm exploring the idea of doing version comparison in AWK

if it's just the version comparison, and not about extracting it, why not use vercmp?

(sorry if i missed the point..)

s243a · #6 Post by **s243a** » Sun 29 Sep 2019, 03:18

sc0ttman wrote:
s243a wrote:I'm exploring the idea of doing version comparison in AWK
if it's just the version comparison, and not about extracting it, why not use vercmp?

(sorry if i missed the point..)

I want the whole repo-db record. I'm using AWK to filter the repo-db. I actually considered your suggestion, because I was having a bit of trouble debuging my AWK code but I'm making progress. I'll keep your idea in mind though as a fallback (e.g. in case gawk isn't installed).

Anyway, I started writing some test code:

Test file "~/repo_test_file"

Code: Select all

base-files_9.9+devuan2.5|base-files|9.9+devuan2.5||Network|368K|pool/DEVUAN/main/b/base-files|base-files_9.9+devuan2.5_all.deb||Devuan base system miscellaneous files|devuan|ascii|

Test command

Code: Select all

cat ~/repo_test_file 2>/dev/null | ~/Find_Base_Files

Test AWK code "~/Find_Base_Files"
** Contains lots of debugging print statements and some commented out old code.

Code: Select all

#!/usr/bin/gawk -f
	     function version_split(version1,version_array,split_chars,       i1,remainder1,matches,num1_epoch_split){
	       print "version1=" version1
	       match(version1,/([[:digit:]])*:(.*|$)/,num1_epoch_split)
	       if (length(num1_epoch_split) > 0 ){
	         version_array[1]=num1_epoch_split[1]
	         remainder1=num1_epoch_split[2]
	       } else {
	       	 version_array[1]=0 
	       	 remainder1=version1
	       }
	       delete num1_epoch_split
	       split_chars[1]=":"
	       i1=2
	       #match(remainder1,/^([^+\.\-~:]+)(([+\.~])([^+\.~\-:]+))*((\-)([^+\.~\-:])+)?$/,matches)   
	       #match(remainder1,/^([^+\.\-~:]+)(([+\.\-~:])(.*))?$/,matches) 
	       match(remainder1,/^([[:digit:]]+)(([+\.\-~:])(.*))?$/,matches) 	       
	       while (length(matches) > 0) {
	         version_array[i1]=matches[1]	
             print "version_array[" i1 "]=" version_array[i1]		
             if (length(matches)>1){       
	           split_chars[i1]=matches[3]
               print "split_chars[" i1 "]=" split_chars[i1]
               remainder1=matches[4]
               print "remainder1=" remainder1
               #match(remainder1,/^([^+\.\-~:]+)(([+\.\-~:]+)(.*))?$/,matches) 
               match(remainder1,/^([[:digit:]]+)(([+\.\-~:]+)(.*))?$/,matches)
             }
             else{
               break
             }	   
                   
 	         i1=i1+1	            	                  
	       }
	     }
	     function v_le(ver_split, val_split,       len_ver){
	       return v_ge(val_split,ver_split)
	     }
	     function v_ge(ver_split, val_split,       len_ver){
	       print "v_ge"
	       if (length(ver_split)<length(val_split)){
	         len_ver=length(ver_split)
	       }
	       else{
	        len_ver=length(val_split)
	       } 
           for (i=1; i<=len_ver; i++){
             print "ver_split[" i "]=" ver_split[i]
             print "val_split[" i "]=" val_split[i]
             if ( ver_split[i] < val_split[i] ) return 0
             if ( ver_split[i] > val_split[i] ) return 1
	       }
	       print "finished ge compare"
	       print "length_ver_split=" length(ver_split) 
	       print "length_val_split=" length(val_split) 
	       if ( length(ver_split) >= length(val_split) )
	         return 1
	       else
	         return 0 
	     }
	     function v_gt(num1, num2,    le){
	       ge = v_ge(num2,num1)
	       if ( ge == 1 ){
	         return 0
	       }
	       else {
	         return 1
	       }
	     }
	     function v_lt(num1, num2){
	       return v_gt(num2, num1)
	     }
	     #An equal-ish functions.
	     function v_e(ver_split, val_split,       len_ver){
	       if (length(ver_split)<length(val_split)) len_ver=length(ver_split)
	       else len_ver=length(val_split)
           for (i=1; i<len_ver; i++){
             if (version_array1[i] != version_array2[i])
               return 0
	       }
          return 1
	     }
	     function arry_cmp(version,ops_array,val_array){
	       print "test"
	       version_split(version,ver_split,ver_split_chars)	     
	       for (i=1; i<=length(ops_array); i++){
	         print "Ops_array " ops_array[i] " " val_array[i]
             version_split(val_array[i],val_split,val_split_chars)	 
	         #https://www.gnu.org/software/gawk/manual/gawk.html#Switch-Statement
	         switch(ops_array[i]){
	         case "<":
	         case "lt":
	           if (v_lt(ver_split,val_split) == 0 ){
	             return 0
		       }
		       break
		     case ">":
		     case "gt":
	           if (v_gt(ver_split,val_split) == 0 ){
	             return 0
		       }
		       break
		     case "<=":
	         case "le":
	           if (v_le(ver_split,val_split) == 0 ){
	             return 0
		       }
		       break	       
		     case ">=":
	         case "ge":
	           if (v_ge(ver_split,val_split) == 0 ){
	             return 0
		       }
		       break	       
		     case "==":
	         case "e":
	           if (v_e(ver_split,val_split) == 0 ){
	             return 0
		       }
		       break	       
		     }
		     #https://unix.stackexchange.com/questions/147957/delete-an-array-in-awk
		     delete val_split
		     delete val_split_chars
	       }
	       print "returning result=1"
	       return 1
	     }
	          BEGIN{FS="|"
	     
awk_cmp_ary_op[1]="lt"
	     
awk_cmp_ary_val[1]="9.10" #"2.1.12"
        }
        {
          #print "wtf"
          if( $2 == "base-files") {
                          if ( arry_cmp($3,awk_cmp_ary_op,awk_cmp_ary_val) == 1 ){
                 print "printing result 1"                            
                print
              }
          }
          else{ 
            match($2,/^(.*[^[:digit:]])([[:digit:]]*$|$)/,pkg_split)
            if ( pkg_split[1] == "base-files" ) {
                            if ( arry_cmp($3,awk_cmp_ary_op,awk_cmp_ary_val) == 1 ){
                print "printing result 2"            
                print
              }
            }
          }
          delete pkg_split
        }

https://pastebin.com/dyPMD49F

In line 133

Code: Select all

awk_cmp_ary_val[1]="9.9" #"2.1.12"

the version number can be changed to compare against the test file. The test code seems to work so I need to integrate it back into my bash test code then eventually back into my pkg fork.

One thing that I had to give up on is repeated capture groups in my regular expressions. AWK seems to only save the last capture (in regular expressions) of a repeated capture group.

Note that my code behaves differently than vercmp. I noticed the version number of the record that I was looking at ended in "+devuan2.5". Presumably this is the revision of devuan rather than the revision of the package. This information isn't useful in comparing packages between different versions of linux but might be indirectly useful for deducing binary compatability (e.g. glibc requirments). What I decided to do is to stop parsing when the next non seperator tolken resulted in a non numeric value.

Code: Select all

match(remainder1,/^([[:digit:]]+)(([+\.\-~:]+)(.*))?$/,matches)

(old)Puppy Linux Discussion Forum

(old)Puppy Linux Discussion Forum

AWK Based Version Comparison

AWK Based Version Comparison

Re: AWK Based Version Comparison

Re: AWK Based Version Comparison