Skip to content

Add name-pattern token used in prefix operators ? and ??#158

Merged
rocky merged 12 commits intomasterfrom
add-Name-pattern-token
Mar 9, 2026
Merged

Add name-pattern token used in prefix operators ? and ??#158
rocky merged 12 commits intomasterfrom
add-Name-pattern-token

Conversation

@rocky
Copy link
Member

@rocky rocky commented Mar 8, 2026

  • Add name-pattern token used in prefix operators: ? and ??.
  • Add a test for some of the regular expressions in tokenise.py
  • More cleanup of the tokenization module
    • more strings marked Final, and some names have been capitalized when they are constant
    • change_token_scanning_mode made public.; it is used by the parser.
    • Information token is now QuestionQuestion.

@rocky rocky force-pushed the add-Name-pattern-token branch from 9717f36 to f1f3906 Compare March 8, 2026 18:18
rocky added 2 commits March 8, 2026 16:29
This will be used to support @ pattern matching (match one or more
characters, but not uppercase letters).

# FIXME: Add after we figure out how to deal with prefix ? (Information)
# versus infix PatternTest.
# def t_Question(self, pattern_match: re.Match) -> Token:
Copy link
Contributor

@mmatera mmatera Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rule seems to be that ? is interpreted as a "Question" if it is the first tag in the command (apart from spaces). For example:

In[1]:=   ? x * y //HoldForm//FullForm                                         

Out[1]//FullForm= HoldForm[Information["x * y", Rule[LongForm, False]]]

In[2]:= a; ? x * y //HoldForm//FullForm                                         

Out[2]//FullForm= HoldForm[Information["x * y", Rule[LongForm, False]]]


In[3]:= a \[NewLine] ? x * y //HoldForm//FullForm                              

Out[3]= a

Out[3]//FullForm= HoldForm[Information["x * y", Rule[LongForm, False]]]

On the other hand, if an element comes before ?, then is interpreted as a PatternTest:

In[4]:= "O" ? x * y //HoldForm//FullForm                                       

Out[4]//FullForm= HoldForm[Times[PatternTest["O", x], y]]

In[5]:= 3 ? x * y //HoldForm//FullForm                                         

Out[5]//FullForm= HoldForm[Times[PatternTest[3, x], y]]

In[6]:= x ? x * y //HoldForm//FullForm                                         

Out[6]//FullForm= HoldForm[Times[PatternTest[x, x], y]]

Update: ? is also interpreted as Question when it is the first operator inside a parenthesis:

In[7]:= a (? s) //FullForm                                                      

Out[7]//FullForm= Times[a, Missing["UnknownSymbol", "s"]]

or it is the first token inside a part of a sequence:

In[8]:= F[a,?b] //FullForm                                                      

Out[8]//FullForm= F[a, Missing["UnknownSymbol", "b"]]

In[9]:= a {?b} //Hold//FullForm                                                 

Out[9]//FullForm= Hold[Times[a, List[Information["b", Rule[LongForm, False]]]]]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rule seems to be that ? is interpreted ad a "Question" if is the first tag in the command (apart from spaces). For example:

This is correct when describing things from an operational level.

However, conceptually, here is how the Mathics3 parser understands this. There is a binary infix "?" and a prefix unary "?". The argument of a prefix unary "?" is a name pattern, while for binary infix operators, the operands are expressions.

And the reason it is important to understand this at a conceptual level rather than at an operational level is that when programming a solution to the problem, one can code (hack) something that tries to address the operational situation, oblivious of the conceptual problem. These kinds of solutions tend to be more complicated, harder to understand, more likely to be incomplete, more fragile, and just plain more code.

The reason I mention this is that we've seen operational but non-conceptual hacks, especially in the code for the scanner and parser.

I'll have a solution for this pretty soon.

Copy link
Member Author

@rocky rocky Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Begin rant about the code.

Even after several years working on this, I am still amazed and saddened at how often, when we try to address a very simple problem, there is a lot of code rewriting and sometimes even flaws in design that have to be corrected. So there is a lot of code rewriting.

The lack of documentation and vagueness behind why the code was written the way it is often hides the flaws in the concept and intent behind the coding. It makes it harder to even discuss ways to approach a problem because you really can't be certain what the initial approach was.

End rant about the code and onto discussion

What I am coming to understand from the current scanner is that token mode switching (such as between expression, file name, and now name pattern) was hitherto assumed to be something handled strictly in the parser. As noted with handling binary versus unary ?, the parser needs to inform the scanner which pattern to use in scanning, specifically, whether we need a name pattern or an expression pattern.

From an operational standpoint, _change_token_scanning_mode should have its leading underscore dropped, since this should be noted as a public (specifically parser-accessible) function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rule seems to be that ? is interpreted ad a "Question" if is the first tag in the command (apart from spaces). For example:

This is correct when describing things from an operational level.

For me at least, understanding things at the operational level is useful for arriving at the conceptual level.

However, conceptually, here is how the Mathics3 parser understands this. There is a binary infix "?" and a prefix unary "?". The argument of a prefix unary "?" is a name pattern, while for binary infix operators, the operands are expressions.

Looking at the previous examples and the ones I just added, the decision between a prefix or infix interpretation depends on what came before: if there is a valid argument for an infix operator, then it should be an infix operator. Otherwise, is a prefix operator.

And the reason it is important to understand this at a conceptual level rather than at an operational level is that when programming a solution to the problem, one can code (hack) something that tries to address the operational situation, oblivious of the conceptual problem. These kinds of solutions tend to be more complicated, harder to understand, more likely to be incomplete, more fragile, and just plain more code.

Yes, I see that.

The reason I mention this is that we've seen operational but non-conceptual hacks, especially in the code for the scanner and parser.

I'll have a solution for this pretty soon.
Great!

Copy link
Member Author

@rocky rocky Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me at least, understanding things at the operational level is useful for arriving at the conceptual level.

Of course! It is hard, if not impossible, to understand conceptual flaws without having detailed behavior.

That leads to a second aspect of problem-solving: narrowing the data to the smallest example that shows the problem. Having lots of data is important to confirm behavior, but when communicating an idea, whittling the data to the smallest example that shows the boundaries helps to convey the concept.

(In a paper, the full details are relegated to footnotes and endnotes.)

">": ("PutAppend", "Put", "GreaterEqual", "Greater"),
"?": ("Information", "PatternTest"),
"=": (
# Note that "Set" has to come last.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are several comments like this, which do not say why it has to come last: the reason - I guess- is that shorter sequences should come last to ensure that a part of the largest pattern is not confused with the shortest pattern. This behavior could be reinforced at the end of the code, by doing something like

# Ensure that longer tokens come first 
literal_tokens = {key: sorted(val, key=lambda x:-len(x) for key, val in literal_tokens}

or with an assertion

# Ensure that longer tokens come first 
assert len({key: val for key, val in literal_tokens if  sorted(val, key=lambda x:-len(x))!=val})==0, "Some tokens were not properly sorted. Longer literals should come first."

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are several comments like this, which do not say why it has to come last: the reason - I guess- is that shorter sequences should come last to ensure that a part of the largest pattern is not confused with the shortest pattern.

Are the comments on lines 413-415 sufficient?

    # Note the tuple is in priority order. In particular, tokens
    # associated with a single character tokens like Factorial (!), has to
    # come after both Unequal (!=), and Factorial2 (!!).
# Ensure that longer tokens come first 
assert len({key: val for key, val in literal_tokens if  sorted(val, key=lambda x:-len(x))!=val})==0, "Some tokens were not properly sorted. Longer literals should come first."

Good idea! This would make an excellent addition to the tests. Would you care to add this to the tests?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are several comments like this, which do not say why it has to come last: the reason - I guess- is that shorter sequences should come last to ensure that a part of the largest pattern is not confused with the shortest pattern.

Are the comments on lines 413-415 sufficient?

I think so

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert len({key: val for key, val in literal_tokens if sorted(val, key=lambda x:-len(x))!=val})==0, "Some tokens were not properly sorted. Longer literals should come first."

OK, I can add this to the pytest module.

(\*\^(\+|-)?\d+)? (?# Exponent)
"""

# The additional characters that can appear as Names[] metacharacters.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another note: the argument of Names is always a String. It happens that when an expression starts with ? or ??, the subsequent code is parsed as a string, as long as it is a valid input string for Names. As far as I could see, this tokenizer mode only applies only for this two prefix operators.

Copy link
Member Author

@rocky rocky Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another note: the argument of Names is always a String. It happens that when an expression starts with ? or ??, the subsequent code is parsed as a string, as long as it is a valid input string for Names.

What I was trying to get to convey is that this kind of pattern is described in the WMA documentation under Names, and the Information docs, which have a form that can use this unquoted reference the Names section.

As far as I could see, this tokenizer mode only applies only for this two prefix operators.

Yes, that kind of custom behavior is a feature of WMA syntax. The FileName syntax also has very limited use in Get, Put, and PutAppend.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mmatera a743fa7 changes the wording. Is this clear/correct now?

# However, in standalone token reading, that is, without the parser,
# having this will give more correct answers. In particular,
# it makes mathics3-tokens give more correct answers, and
# test_tokeniser has a test that ??X identifies X as a NamePattern.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now (in master), what the parser does is just wrong. In my patch, I assume that the tokenizer returns a String or a Symbol as the following token. What should happen is that the tokenizer returns always a String as the token following a Question/QuestionQuestion token.

Copy link
Member Author

@rocky rocky Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now (in master), what the parser does is just wrong. In my patch, I assume that the tokenizer returns a String or a Symbol as the following token. What should happen is that the tokenizer returns always a String as the token following a Question/QuestionQuestion token.

In the revised behavior, the token tag is a NamePattern which is neither a Symbol nor a String. In my opinion, there nothing wrong with this. Quite the contrary, this is extremely clear. It is more explicit in indicating that when it is considered a String, it is only allowed to have identifier symbols and the NAMES_WILDCARDS characters.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a NamePattern token is OK. I what is (was?) wrong was to get a Symbol, and otherwise show a "Missing[...]" expression.

@rocky rocky changed the title Add name pattern token Add name-pattern token used in prefix operators ? and ?? Mar 9, 2026
Copy link
Contributor

@mmatera mmatera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@rocky
Copy link
Member Author

rocky commented Mar 9, 2026

Merging this has to be coordinated with #1714, and #1715. So for now, we have to wait on merging.

@rocky rocky merged commit 08e224d into master Mar 9, 2026
12 checks passed
@rocky rocky deleted the add-Name-pattern-token branch March 9, 2026 16:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants