The only encoding-related goawk thing I could find is at the goawk -h help:
Additional GoAWK features:
-c use Unicode chars for index, length, match, substr, and %c
So my understanding was that it works in the C/POSIX locale by default (logically) regardless of the system locale (LC_* values), and using -c switches to some "Unicode mode", where "Unicode mode" probably means UTF-8.
Is this understanding correct?
If yes (and also if no), then it's probably worth documenting more explicitly someplace (the README maybe?), and additionally I think I found an issue that matching always behave as if -c was specified:
$ # '\342\225\213' is UTF-8 of U+254B (boxdraw bold horizontal and vertical)
$ FMT='X\nYYY\n\342\225\213\n'
$ printf "$FMT"
X
YYY
╋
$ ACMD='/^.$/ {print length($1) " /^.$/ " $1}; /^...$/ {print length($1) " /^...$/ " $1}'
$ # --- without -c ---
$ printf "$FMT" | ./goawk "$ACMD"
1 /^.$/ X
3 /^...$/ YYY
3 /^.$/ ╋
$ printf "$FMT" | LC_ALL=C ./goawk "$ACMD"
1 /^.$/ X
3 /^...$/ YYY
3 /^.$/ ╋
$ # --- with -c ---
$ printf "$FMT" | ./goawk -c "$ACMD"
1 /^.$/ X
3 /^...$/ YYY
1 /^.$/ ╋
$ printf "$FMT" | LC_ALL=C ./goawk -c "$ACMD"
1 /^.$/ X
3 /^...$/ YYY
1 /^.$/ ╋
(EDIT: for clarity, moved the printf and goawk arguments into $FMT and $ACMD, respectively)
This seems to confirm that LC_ALL=C is ignored (my default locale is en_US.UTF-8), as it doesn't affect the result, at least of this test case.
-c does affect the result, at least of length($1) for a 3-bytes single-UTF8-codepoint. It's 3 without -c, and 1 with -c. So far looks OK.
However, the match result is unaffected by -c as far as I can tell. ^.$ always matches a 3-bytes codepoint regardless if -c is used or not used, and ^...$ never matches the same 3 bytes.
The only encoding-related goawk thing I could find is at the
goawk -hhelp:So my understanding was that it works in the C/POSIX locale by default (logically) regardless of the system locale (
LC_*values), and using-cswitches to some "Unicode mode", where "Unicode mode" probably means UTF-8.Is this understanding correct?
If yes (and also if no), then it's probably worth documenting more explicitly someplace (the README maybe?), and additionally I think I found an issue that matching always behave as if
-cwas specified:(EDIT: for clarity, moved the
printfandgoawkarguments into$FMTand$ACMD, respectively)This seems to confirm that LC_ALL=C is ignored (my default locale is
en_US.UTF-8), as it doesn't affect the result, at least of this test case.-cdoes affect the result, at least oflength($1)for a 3-bytes single-UTF8-codepoint. It's 3 without-c, and1with-c. So far looks OK.However, the match result is unaffected by
-cas far as I can tell.^.$always matches a 3-bytes codepoint regardless if-cis used or not used, and^...$never matches the same 3 bytes.