Skip to content

New command: Parse Lines (Alt+A) #453

@ProgerXP

Description

@ProgerXP

This command is made last in Edit > Block (P&arse). It's similar to Modify Lines (Alt+M) in treating initial selection, dialog layout with Syslink and remembering input values. Dialog has two buttons on the side, &Parse each line:, input, Use |scanf()| instead of regular &expression (checkbox, checked by default; |...| = link to an online reference), bunch of links and texts similar to Alt+M's, &Replace matching line:, input, another bunch of links and texts. Links and texts will be determined later.

Operation:

  1. If the scanf() checkbox is checked, preprocess the Parse string (see below).
    • if it isn't and the regexp is malformed, either disable OK (as done in Find) or focus Parse and exit
  2. Walk document line by line like Alt+M does; for every line:
    • run vsscanf() or regexp on it; if sscanf()'s result is not exactly N or if the regexp doesn't match, skip to next line
    • replace the line with the result of calling vsprintf()

It's very easy to have undefined behaviour and even crash with bad format strings. It's also possible that vs...f() cannot be used in our scenario. In this case rather than calling them once for each input line, call them once per each format specifier in the format string and/or write custom implementation. If one call per line is made, preprocessing determines N - the number of format specifiers producing data (i.e. non-%* and non-%% specifiers) and may do some checks to avoid UB.

All functions use neutral locale ("C"), in particular no grouping (12,345) and . for decimal part (3.14). This corresponds to Math Eval's copy result format.

We must support the limited feature set of scanf():

  • format specifier begins with %, followed by % to consume literal % or by n to store number of characters read so far
  • otherwise, % may be followed by * (performs the match but doesn't store result), then by sign-less positive number ("width") used by c and s
  • finally, specifier ends on one of d i x f s c or a character class (below); aliases X e g E a are not supported
  • "length" (l h hh j ll L q t z) is not supported; d i x are always signed int or long, f float or double (same type as used for Math Eval), s c [...] always wide
  • character class is defined between two [ ] brackets; initial [ may be followed by ^ (negative class), by ] (literal ] part of the class), then any number of characters and/or 7-bit ANSI ranges made with - (a-z), then optional - before final ] (literal - part of the class)
  • if custom implementation is used, whitespace in format string must be treated non-standard: require and consume at least 1 isspace symbol (standard allows 0)

Featureset of printf():

  • format specifier begins with %, followed by % to output literal %
  • otherwise, % may be followed by position$ (a sign-less positive number followed by $), by "flags', by "width" and by .precision
  • "position" consumes next argument, or nth argument if $ is present
  • "flags" may be - (change justification from right to left) or one of (_ = space) 0 _ _0 + +0 (for numeric specifiers; 0 changes padding symbol from space to 0; _ outputs a space if the number is non-negative while + outputs literal + in this case)
  • "width" sets padding and may be a positive number or * (consume next argument as value) or *n$ (consume nth argument as value)
  • "precision" (non-negative number) sets length of mantissa (for numeric specifiers) or whole string; may be .* or .*n$ (non-standard, only if custom implementation)
  • finally, specifier ends on one of d o x X f c s and, if using custom sprintf() implementation, n with different behaviour (outputs number of characters read so far rather than storing it)
  • aliases and specifiers i u e E F g G a A C S p n m are not supported
  • as with scanf(), "length" (hh h l ll q L j z Z t) is not supported and is automatic
  • lifted limitations of standard C printf(): n$ style may be mixed with non-n$ (may be solved by preprocessing format string) and it may skip arguments (1$ and 3$ may be used in format string with 2$ unused)

This feature will allow changing column order in a CSV: Parse = %s;%s, Replace = %2$s;%1$s; as well as aligning text:

int n = 0;
char[] str = "foo\n";

Parse = %s %s = %s;, Replace = %6s %-3s = %s;:

   int n   = 0;
char[] str = "foo\n";

Combined with Sort (Alt+O), one can order lines by their length:

a
ccc
bb

First Alt+A (Parse = %s%n, Replace = %2$06d %1$s), then Alt+O (Logical number comparison`):

000001 a
000002 bb
000003 ccc

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions