Add `name-pattern` token used in prefix operators `?` and `??` by rocky · Pull Request #158 · Mathics3/mathics-scanner

rocky · 2026-03-08T18:13:54Z

Add name-pattern token used in prefix operators: ? and ??.
Add a test for some of the regular expressions in tokenise.py
More cleanup of the tokenization module
- more strings marked Final, and some names have been capitalized when they are constant
- change_token_scanning_mode made public.; it is used by the parser.
- Information token is now QuestionQuestion.

This will be used to support @ pattern matching (match one or more characters, but not uppercase letters).

mmatera · 2026-03-09T03:01:34Z

mathics_scanner/tokeniser.py


+    # FIXME: Add after we figure out how to deal with prefix ? (Information)
+    # versus infix PatternTest.
+    # def t_Question(self, pattern_match: re.Match) -> Token:


The rule seems to be that ? is interpreted as a "Question" if it is the first tag in the command (apart from spaces). For example:

In[1]:= ? x * y //HoldForm//FullForm Out[1]//FullForm= HoldForm[Information["x * y", Rule[LongForm, False]]] In[2]:= a; ? x * y //HoldForm//FullForm Out[2]//FullForm= HoldForm[Information["x * y", Rule[LongForm, False]]] In[3]:= a \[NewLine] ? x * y //HoldForm//FullForm Out[3]= a Out[3]//FullForm= HoldForm[Information["x * y", Rule[LongForm, False]]]

On the other hand, if an element comes before ?, then is interpreted as a PatternTest:

In[4]:= "O" ? x * y //HoldForm//FullForm Out[4]//FullForm= HoldForm[Times[PatternTest["O", x], y]] In[5]:= 3 ? x * y //HoldForm//FullForm Out[5]//FullForm= HoldForm[Times[PatternTest[3, x], y]] In[6]:= x ? x * y //HoldForm//FullForm Out[6]//FullForm= HoldForm[Times[PatternTest[x, x], y]]

Update: ? is also interpreted as Question when it is the first operator inside a parenthesis:

In[7]:= a (? s) //FullForm Out[7]//FullForm= Times[a, Missing["UnknownSymbol", "s"]]

or it is the first token inside a part of a sequence:

In[8]:= F[a,?b] //FullForm Out[8]//FullForm= F[a, Missing["UnknownSymbol", "b"]] In[9]:= a {?b} //Hold//FullForm Out[9]//FullForm= Hold[Times[a, List[Information["b", Rule[LongForm, False]]]]]

The rule seems to be that ? is interpreted ad a "Question" if is the first tag in the command (apart from spaces). For example:

This is correct when describing things from an operational level.

However, conceptually, here is how the Mathics3 parser understands this. There is a binary infix "?" and a prefix unary "?". The argument of a prefix unary "?" is a name pattern, while for binary infix operators, the operands are expressions.

And the reason it is important to understand this at a conceptual level rather than at an operational level is that when programming a solution to the problem, one can code (hack) something that tries to address the operational situation, oblivious of the conceptual problem. These kinds of solutions tend to be more complicated, harder to understand, more likely to be incomplete, more fragile, and just plain more code.

The reason I mention this is that we've seen operational but non-conceptual hacks, especially in the code for the scanner and parser.

I'll have a solution for this pretty soon.

Begin rant about the code.

Even after several years working on this, I am still amazed and saddened at how often, when we try to address a very simple problem, there is a lot of code rewriting and sometimes even flaws in design that have to be corrected. So there is a lot of code rewriting.

The lack of documentation and vagueness behind why the code was written the way it is often hides the flaws in the concept and intent behind the coding. It makes it harder to even discuss ways to approach a problem because you really can't be certain what the initial approach was.

End rant about the code and onto discussion

What I am coming to understand from the current scanner is that token mode switching (such as between expression, file name, and now name pattern) was hitherto assumed to be something handled strictly in the parser. As noted with handling binary versus unary ?, the parser needs to inform the scanner which pattern to use in scanning, specifically, whether we need a name pattern or an expression pattern.

From an operational standpoint, _change_token_scanning_mode should have its leading underscore dropped, since this should be noted as a public (specifically parser-accessible) function.

The rule seems to be that ? is interpreted ad a "Question" if is the first tag in the command (apart from spaces). For example:

This is correct when describing things from an operational level.

For me at least, understanding things at the operational level is useful for arriving at the conceptual level.

However, conceptually, here is how the Mathics3 parser understands this. There is a binary infix "?" and a prefix unary "?". The argument of a prefix unary "?" is a name pattern, while for binary infix operators, the operands are expressions.

Looking at the previous examples and the ones I just added, the decision between a prefix or infix interpretation depends on what came before: if there is a valid argument for an infix operator, then it should be an infix operator. Otherwise, is a prefix operator.

And the reason it is important to understand this at a conceptual level rather than at an operational level is that when programming a solution to the problem, one can code (hack) something that tries to address the operational situation, oblivious of the conceptual problem. These kinds of solutions tend to be more complicated, harder to understand, more likely to be incomplete, more fragile, and just plain more code.

Yes, I see that.

The reason I mention this is that we've seen operational but non-conceptual hacks, especially in the code for the scanner and parser.

I'll have a solution for this pretty soon.
Great!

For me at least, understanding things at the operational level is useful for arriving at the conceptual level.

Of course! It is hard, if not impossible, to understand conceptual flaws without having detailed behavior.

That leads to a second aspect of problem-solving: narrowing the data to the smallest example that shows the problem. Having lots of data is important to confirm behavior, but when communicating an idea, whittling the data to the smallest example that shows the boundaries helps to convey the concept.

(In a paper, the full details are relegated to footnotes and endnotes.)

mmatera · 2026-03-09T11:43:16Z

mathics_scanner/tokeniser.py

-        ">": ("PutAppend", "Put", "GreaterEqual", "Greater"),
-        "?": ("Information", "PatternTest"),
+        "=": (
+            # Note that "Set" has to come last.


There are several comments like this, which do not say why it has to come last: the reason - I guess- is that shorter sequences should come last to ensure that a part of the largest pattern is not confused with the shortest pattern. This behavior could be reinforced at the end of the code, by doing something like

# Ensure that longer tokens come first literal_tokens = {key: sorted(val, key=lambda x:-len(x) for key, val in literal_tokens}

or with an assertion

# Ensure that longer tokens come first assert len({key: val for key, val in literal_tokens if sorted(val, key=lambda x:-len(x))!=val})==0, "Some tokens were not properly sorted. Longer literals should come first."

There are several comments like this, which do not say why it has to come last: the reason - I guess- is that shorter sequences should come last to ensure that a part of the largest pattern is not confused with the shortest pattern.

Are the comments on lines 413-415 sufficient?

# Note the tuple is in priority order. In particular, tokens # associated with a single character tokens like Factorial (!), has to # come after both Unequal (!=), and Factorial2 (!!).

# Ensure that longer tokens come first assert len({key: val for key, val in literal_tokens if sorted(val, key=lambda x:-len(x))!=val})==0, "Some tokens were not properly sorted. Longer literals should come first."

Good idea! This would make an excellent addition to the tests. Would you care to add this to the tests?

There are several comments like this, which do not say why it has to come last: the reason - I guess- is that shorter sequences should come last to ensure that a part of the largest pattern is not confused with the shortest pattern.

Are the comments on lines 413-415 sufficient?

I think so

assert len({key: val for key, val in literal_tokens if sorted(val, key=lambda x:-len(x))!=val})==0, "Some tokens were not properly sorted. Longer literals should come first."

OK, I can add this to the pytest module.

mmatera · 2026-03-09T11:52:47Z

mathics_scanner/tokeniser.py

 (\*\^(\+|-)?\d+)?                           (?# Exponent)
 """

+# The additional characters that can appear as Names[] metacharacters.


Another note: the argument of Names is always a String. It happens that when an expression starts with ? or ??, the subsequent code is parsed as a string, as long as it is a valid input string for Names. As far as I could see, this tokenizer mode only applies only for this two prefix operators.

Another note: the argument of Names is always a String. It happens that when an expression starts with ? or ??, the subsequent code is parsed as a string, as long as it is a valid input string for Names.

What I was trying to get to convey is that this kind of pattern is described in the WMA documentation under Names, and the Information docs, which have a form that can use this unquoted reference the Names section.

As far as I could see, this tokenizer mode only applies only for this two prefix operators.

Yes, that kind of custom behavior is a feature of WMA syntax. The FileName syntax also has very limited use in Get, Put, and PutAppend.

@mmatera a743fa7 changes the wording. Is this clear/correct now?

mathics_scanner/tokeniser.py

mmatera · 2026-03-09T12:05:13Z

mathics_scanner/tokeniser.py

+    # However, in standalone token reading, that is, without the parser,
+    # having this will give more correct answers. In particular,
+    # it makes mathics3-tokens give more correct answers, and
+    # test_tokeniser has a test that ??X identifies X as a NamePattern.


Right now (in master), what the parser does is just wrong. In my patch, I assume that the tokenizer returns a String or a Symbol as the following token. What should happen is that the tokenizer returns always a String as the token following a Question/QuestionQuestion token.

Right now (in master), what the parser does is just wrong. In my patch, I assume that the tokenizer returns a String or a Symbol as the following token. What should happen is that the tokenizer returns always a String as the token following a Question/QuestionQuestion token.

In the revised behavior, the token tag is a NamePattern which is neither a Symbol nor a String. In my opinion, there nothing wrong with this. Quite the contrary, this is extremely clear. It is more explicit in indicating that when it is considered a String, it is only allowed to have identifier symbols and the NAMES_WILDCARDS characters.

a NamePattern token is OK. I what is (was?) wrong was to get a Symbol, and otherwise show a "Missing[...]" expression.

Co-authored-by: Juan Mauricio Matera <matera@fisica.unlp.edu.ar>

mmatera

LGTM

rocky · 2026-03-09T16:34:29Z

Merging this has to be coordinated with #1714, and #1715. So for now, we have to wait on merging.

rocky and others added 4 commits March 8, 2026 14:10

Add Names Pattern tokenization

b686852

Start regexp text. Change branch for CI testing

d91ea41

Update tokeniser.py

24e7a7b

Update tokeniser.py

f1f3906

rocky force-pushed the add-Name-pattern-token branch from 9717f36 to f1f3906 Compare March 8, 2026 18:18

rocky added 2 commits March 8, 2026 16:29

Uppercase _letter{,likes}

3b5f6d7

This will be used to support @ pattern matching (match one or more characters, but not uppercase letters).

Information -> QuestionQuestion

95c679d

mmatera reviewed Mar 9, 2026

View reviewed changes

Make change_token_scanning_mode public

28dd5ab

mmatera reviewed Mar 9, 2026

View reviewed changes

mathics_scanner/tokeniser.py Outdated Show resolved Hide resolved

mmatera reviewed Mar 9, 2026

View reviewed changes

rocky and others added 5 commits March 9, 2026 08:08

Update mathics_scanner/tokeniser.py

c441b3d

Co-authored-by: Juan Mauricio Matera <matera@fisica.unlp.edu.ar>

Clarify a comment.

a743fa7

Names vs Information prefix operand in comments

364c0a3

Tweak comments, yet again.

f8ebd66

with-names-pattern -> with-names-wildcard

09e6c03

rocky changed the title ~~Add name pattern token~~ Add name-pattern token used in prefix operators ? and ?? Mar 9, 2026

mmatera approved these changes Mar 9, 2026

View reviewed changes

rocky merged commit 08e224d into master Mar 9, 2026
12 checks passed

rocky deleted the add-Name-pattern-token branch March 9, 2026 16:59

Uh oh!

Conversation

rocky commented Mar 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mmatera Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rocky Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rocky Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rocky Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rocky Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mmatera left a comment

Choose a reason for hiding this comment

Uh oh!

rocky commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rocky commented Mar 8, 2026 •

edited

Loading

mmatera Mar 9, 2026 •

edited

Loading

rocky Mar 9, 2026 •

edited

Loading

rocky Mar 9, 2026 •

edited

Loading

rocky Mar 9, 2026 •

edited

Loading

rocky Mar 9, 2026 •

edited

Loading

rocky commented Mar 9, 2026 •

edited

Loading