[suggestion] revise unicode/binary mode decision

As far as I can tell, POSIX awk is required to respect the current locale, but goawk doesn't do that. Instead, it behaves in byte mode by default, unless `-c` is specified, at which case it behaves in UTF-8 codepoints mode.

And while goawk probably can't do arbitrary locales, and ignoring bugs, it seems to have support for LC_CTYPE of either UTF-8 or plain bytes (ASCII/C ?).

So assuming it's desirable for goawk to try and respect the current locale where possible, I think it could look like this:
- Add argument support for `-b` for binary/bytes mode. gawk has the same-ish `-b` as alias for `--characters-as-bytes`.
- During arguments parsing, if `-b` or `-c` is specified (or replaced it with `-u`) then use the specified mode.
- Else try to deduce it from the environment, like so:
  - If any of `LC_ALL`, `LC_CTYPE`, `LC_LANG`, in this override order is defined - even if empty (in go: `os.LookupEnv(name)`), stop the search and use its value:
    - if its tolower includes `.utf8` or `.utf-8` - enable UTF-8 mode, else enable byte/binary mode.
- Else (no `-b`/`-c` and none of these vars is defined), pick some default, maybe depending on the platform (e.g. on Windows, and maybe also elsewhere, probably enable UTF-8 because that's what most text files are likely to be).

Does something like this make sense? I think it should be fairly trivial to implement, so the real question is whether such behavior is desirable, right?

Is there some assumption or empirical observation that awk scripts tend to behave better in goawk in one mode or the other?

Is there a meaningful performance impact depending on the unicode mode? I think in general bytes mode is typically faster, but considering that it might be hard for goawk to do regexp in bytes mode, does it still matter for goawk?



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[suggestion] revise unicode/binary mode decision #274

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

[suggestion] revise unicode/binary mode decision #274

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions