Page 1 of 1

AWK: match($2,/^(.*[^:digit:])([:digit:]*$|$)/,pkg_split)

Posted: Sat 21 Sep 2019, 03:19
by s243a
I want to use AWK to mach libcN where 'N' is the major version number. I'm using AWK on a puppy database file for a package repo. These repo files follow the petspet format where the second field (i.e. $2) is the package name. To the best of my understanding the regular expression to do this should be:

Code: Select all

match($2,/^(.*[^:digit:])([:digit:]*$|$)/,pkg_split)
https://www.gnu.org/software/gawk/manua ... tions.html

but for some inexplicable reason it appears to be matching 'g' as a numeric digit even though the docs say the following:
A character class is only valid in a regexp inside the brackets of a bracket expression. Character classes consist of ‘[:’, a keyword denoting the class, and ‘:]’. Table 3.1 lists the character classes defined by the POSIX standard.
....
[:digit:] Numeric characters
https://www.gnu.org/software/gawk/manua ... xpressions

Here is my debugging output which shows the awk program:

Code: Select all

++ cat /var/packages/repo/Packages-devuan-ascii-non-free
++ awk '    BEGIN{FS="|"}
    {
      match($2,/^(.*[^:digit:])([:digit:]*$|$)/,pkg_split)
      if ( pkg_split[1] == "libc" ) {
        print
      }
    }'
+ awk_result='libcg_3.1.0013-2+b1|libcg|3.1.0013-2+b1||BuildingBlock|11609K|pool/DEBIAN/non-free/n/nvidia-cg-toolkit|libcg_3.1.0013-2+b1_i386.deb|+libc6&ge2.3.6-6|Nvidia Cg core runtime library|devuan|ascii|'
+ '[' '!' -z 'libcg_3.1.0013-2+b1|libcg|3.1.0013-2+b1||BuildingBlock|11609K|pool/DEBIAN/non-free/n/nvidia-cg-toolkit|libcg_3.1.0013-2+b1_i386.deb|+libc6&ge2.3.6-6|Nvidia Cg core runtime library|devuan|ascii|' ']'

The debugging output is produced as follows:

Code: Select all

bash -x /usr/sbin/pkg-list-alias libc 2>&1 | tee pkg_list_alias.log
and my script can be found at:
https://pastebin.com/Yb7gNV2r

which is an updated version of a script which I discussed at:
http://murga-linux.com/puppy/viewtopic. ... 47#1037047

Here is the line of code which calls the AWK program:

Code: Select all

awk_result="$(cat $aRepoDB | awk "$AWK_PRG")"

Re: AWK: match($2,/^(.*[^:digit:])([:digit:]*$|$)/,pkg_split)

Posted: Sat 21 Sep 2019, 03:21
by s243a
delete

Posted: Sat 21 Sep 2019, 04:25
by technosaurus
It may save you some time to test it here:
https://regex101.com/

When you use parens, you can usually print out the matches with \N for debugging (where N is the Nth set of parens), I don't recall how to do it in awk though.

Posted: Sat 21 Sep 2019, 04:56
by MochiMoppel
Please post a simple example. Your input string and the expected output.
Your regex pattern looks wrong as you may need an additional set of square brackets.

Posted: Sat 21 Sep 2019, 11:58
by Burunduk
As MochiMoppel has pointed out, you definitely need additional square brackets here. [^:digit:] is the same as [^:dgit] and it matches anything but 'g' and the other four characters.
So, if all you want to match is something like AAANNN where A is not a digit and N is, then this will do:

Code: Select all

/([^0-9]+)([0-9]*)/
or using POSIX character classes:

Code: Select all

/([^[:digit:]]+)([[:digit:]]*)/
Note also that an array argument is a GAWK extension not supported by the busybox awk.

Posted: Sat 21 Sep 2019, 12:53
by s243a
Burunduk wrote:As MochiMoppel has pointed out, you definitely need additional square brackets here. [^:digit:] is the same as [^:dgit] and it matches anything but 'g' and the other four characters.
So, if all you want to match is something like AAANNN where A is not a digit and N is, then this will do:

Code: Select all

/([^0-9]+)([0-9]*)/
or using POSIX character classes:

Code: Select all

/([^[:digit:]]+)([[:digit:]]*)/
Thankyou. That was most helpful :). Now I get the correct debugging output:

Code: Select all

++ cat /var/packages/repo/Packages-devuan-ascii-main
++ awk '    BEGIN{FS="|"}
    {
      match($2,/^(.*[^[:digit:]])([[:digit:]]*$|$)/,pkg_split)
      if ( pkg_split[1] == "libc" ) {
        print
      }
    }'
+ awk_result='libc6_2.24-11+deb9u4|libc6|2.24-11+deb9u4||BuildingBlock|9579K|pool/DEBIAN/main/g/glibc|libc6_2.24-11+deb9u4_i386.deb|+libgcc1|GNU C Library: Shared libraries|devuan|ascii|'
At fist I didn't read your post carefully enough so I only fixed the first set of square brackets. [^[:digit:]] and didn't realize that I also needed to double up on the second set of square brackets [[:digit:]]. In hindsight, I can see how this would make parsing easier for awk.
Note also that an array argument is a GAWK extension not supported by the busybox awk.
I was wondering that. This is good to know. We can do something similar with grep, if we have the full version of grep but not awk. However, awk is more efficient for this application.

Edit: an updated version of the original script (with the bracket fix) can be found at: https://pastebin.com/KtUikhdS

Posted: Sat 21 Sep 2019, 12:58
by MochiMoppel
Burunduk wrote:As MochiMoppel has pointed out, you definitely need additional square brackets here.
Yes, there :lol:
Note also that an array argument is a GAWK extension not supported by the busybox awk.
Methinks that s243a doesn't need array arguments at all. For pulling out a leading non numeric string something like this should do

Code: Select all

       sub(/[0-9].*/,"",$2)
       print $2
but I don't know what his strings look like and what he wants to achieve.

Posted: Sat 21 Sep 2019, 13:23
by s243a
MochiMoppel wrote:
Burunduk wrote:As MochiMoppel has pointed out, you definitely need additional square brackets here.
Yes, there :lol:
Note also that an array argument is a GAWK extension not supported by the busybox awk.
Methinks that s243a doesn't need array arguments at all. For pulling out a leading non numeric string something like this should do

Code: Select all

       sub(/[0-9].*/,"",$2)
       print $2
but I don't know what his strings look like and what he wants to achieve.
I want the entire repo db record. So something like the following might also work (untested):

Code: Select all

pkg=gensub(/[0-9].*/,"","g",$2)
if ( pkg = libc ) {
  print
}
https://www.gnu.org/software/gawk/manua ... tions.html

**Note that I like the array syntax because it is more general and more efficient than the gensub approach even if it isn't as widely supported.

As a side note I thought the '6' in libc6 was the major package version but I see from above that the package version is 2.24-11+deb9u4. However, I note that if I look at the file names in the package that the actual lib is called:

Code: Select all

/lib/i386-linux-gnu/libc.so.6
https://packages.debian.org/stretch/i386/libc6/filelist

by linux standards the '6' should be the version of the lib rather than the version of the package:
3.1.1. Shared Library Names

Every shared library has a special name called the ``soname''. The soname has the prefix ``lib'', the name of the library, the phrase ``.so'', followed by a period and a version number that is incremented whenever the interface changes (as a special exception, the lowest-level C libraries don't start with ``lib''). A fully-qualified soname includes as a prefix the directory it's in; on a working system a fully-qualified soname is simply a symbolic link to the shared library's ``real name''.
http://tldp.org/HOWTO/Program-Library-H ... aries.html

I didn't realize that the package versions were different than the lib versions for some linux packages. I will think about the implications of this distinction.

I will note that some package repos do not use the lib version as a suffix in the package name.

Posted: Sat 21 Sep 2019, 16:45
by technosaurus
Burunduk wrote:As MochiMoppel has pointed out, you definitely need additional square brackets here. [^:digit:] is the same as [^:dgit] and it matches anything but 'g' and the other four characters.
So, if all you want to match is something like AAANNN where A is not a digit and N is, then this will do:

Code: Select all

/([^0-9]+)([0-9]*)/
or using POSIX character classes:

Code: Select all

/([^[:digit:]]+)([[:digit:]]*)/
Note also that an array argument is a GAWK extension not supported by the busybox awk.
\d is also the same as the [[:digit:]] in some regex engines
You can emulate n-dimensional arrays in busybox awk with a separator - usually a comma.
... so instead of array[j][k] you'd use array[i,j,k] (works in gawk&make too)

The link I posted is useful to build and test your regex, but I used to always just do a Ctrl+h in geany.

Posted: Mon 23 Sep 2019, 05:10
by s243a
technosaurus wrote:It may save you some time to test it here:
https://regex101.com/

When you use parens, you can usually print out the matches with \N for debugging (where N is the Nth set of parens), I don't recall how to do it in awk though.
I found an example where awk behaves differently than this test program.

Code: Select all

# echo ac | awk "{match(\$1,/(:?a|b)(c|d)/,matches); print matches[1]}"
a
If AWK supported non-capturing groups than the result would be "c". AWK doesn't appear to support non-capturing groups. The following link seems to agree with my claim about AWK's limitation here:
https://comp.unix.programmer.narkive.co ... gawk-regex