As far as I can tell, POSIX awk is required to respect the current locale, but goawk doesn't do that. Instead, it behaves in byte mode by default, unless -c is specified, at which case it behaves in UTF-8 codepoints mode.
And while goawk probably can't do arbitrary locales, and ignoring bugs, it seems to have support for LC_CTYPE of either UTF-8 or plain bytes (ASCII/C ?).
So assuming it's desirable for goawk to try and respect the current locale where possible, I think it could look like this:
- Add argument support for
-b for binary/bytes mode. gawk has the same-ish -b as alias for --characters-as-bytes.
- During arguments parsing, if
-b or -c is specified (or replaced it with -u) then use the specified mode.
- Else try to deduce it from the environment, like so:
- If any of
LC_ALL, LC_CTYPE, LC_LANG, in this override order is defined - even if empty (in go: os.LookupEnv(name)), stop the search and use its value:
- if its tolower includes
.utf8 or .utf-8 - enable UTF-8 mode, else enable byte/binary mode.
- Else (no
-b/-c and none of these vars is defined), pick some default, maybe depending on the platform (e.g. on Windows, and maybe also elsewhere, probably enable UTF-8 because that's what most text files are likely to be).
Does something like this make sense? I think it should be fairly trivial to implement, so the real question is whether such behavior is desirable, right?
Is there some assumption or empirical observation that awk scripts tend to behave better in goawk in one mode or the other?
Is there a meaningful performance impact depending on the unicode mode? I think in general bytes mode is typically faster, but considering that it might be hard for goawk to do regexp in bytes mode, does it still matter for goawk?
As far as I can tell, POSIX awk is required to respect the current locale, but goawk doesn't do that. Instead, it behaves in byte mode by default, unless
-cis specified, at which case it behaves in UTF-8 codepoints mode.And while goawk probably can't do arbitrary locales, and ignoring bugs, it seems to have support for LC_CTYPE of either UTF-8 or plain bytes (ASCII/C ?).
So assuming it's desirable for goawk to try and respect the current locale where possible, I think it could look like this:
-bfor binary/bytes mode. gawk has the same-ish-bas alias for--characters-as-bytes.-bor-cis specified (or replaced it with-u) then use the specified mode.LC_ALL,LC_CTYPE,LC_LANG, in this override order is defined - even if empty (in go:os.LookupEnv(name)), stop the search and use its value:.utf8or.utf-8- enable UTF-8 mode, else enable byte/binary mode.-b/-cand none of these vars is defined), pick some default, maybe depending on the platform (e.g. on Windows, and maybe also elsewhere, probably enable UTF-8 because that's what most text files are likely to be).Does something like this make sense? I think it should be fairly trivial to implement, so the real question is whether such behavior is desirable, right?
Is there some assumption or empirical observation that awk scripts tend to behave better in goawk in one mode or the other?
Is there a meaningful performance impact depending on the unicode mode? I think in general bytes mode is typically faster, but considering that it might be hard for goawk to do regexp in bytes mode, does it still matter for goawk?