Skip to content

matching always behaves as if "-c" was specified #273

@avih

Description

@avih

The only encoding-related goawk thing I could find is at the goawk -h help:

Additional GoAWK features:
  -c                use Unicode chars for index, length, match, substr, and %c

So my understanding was that it works in the C/POSIX locale by default (logically) regardless of the system locale (LC_* values), and using -c switches to some "Unicode mode", where "Unicode mode" probably means UTF-8.

Is this understanding correct?

If yes (and also if no), then it's probably worth documenting more explicitly someplace (the README maybe?), and additionally I think I found an issue that matching always behave as if -c was specified:

$ # '\342\225\213' is UTF-8 of U+254B (boxdraw bold horizontal and vertical) 

$ FMT='X\nYYY\n\342\225\213\n'

$ printf "$FMT"
X
YYY
╋

$ ACMD='/^.$/ {print length($1) " /^.$/ " $1}; /^...$/ {print length($1) " /^...$/ " $1}'

$ # --- without -c ---

$ printf "$FMT" | ./goawk "$ACMD"
1 /^.$/ X
3 /^...$/ YYY
3 /^.$/ ╋

$ printf "$FMT" | LC_ALL=C ./goawk "$ACMD"
1 /^.$/ X
3 /^...$/ YYY
3 /^.$/ ╋

$ # --- with -c ---

$ printf "$FMT" | ./goawk -c "$ACMD"
1 /^.$/ X
3 /^...$/ YYY
1 /^.$/ ╋

$ printf "$FMT" | LC_ALL=C ./goawk -c "$ACMD"
1 /^.$/ X
3 /^...$/ YYY
1 /^.$/ ╋

(EDIT: for clarity, moved the printf and goawk arguments into $FMT and $ACMD, respectively)

This seems to confirm that LC_ALL=C is ignored (my default locale is en_US.UTF-8), as it doesn't affect the result, at least of this test case.

-c does affect the result, at least of length($1) for a 3-bytes single-UTF8-codepoint. It's 3 without -c, and 1 with -c. So far looks OK.

However, the match result is unaffected by -c as far as I can tell. ^.$ always matches a 3-bytes codepoint regardless if -c is used or not used, and ^...$ never matches the same 3 bytes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions