Add name-pattern token used in prefix operators ? and ??#158
Conversation
9717f36 to
f1f3906
Compare
This will be used to support @ pattern matching (match one or more characters, but not uppercase letters).
mathics_scanner/tokeniser.py
Outdated
|
|
||
| # FIXME: Add after we figure out how to deal with prefix ? (Information) | ||
| # versus infix PatternTest. | ||
| # def t_Question(self, pattern_match: re.Match) -> Token: |
There was a problem hiding this comment.
The rule seems to be that ? is interpreted as a "Question" if it is the first tag in the command (apart from spaces). For example:
In[1]:= ? x * y //HoldForm//FullForm
Out[1]//FullForm= HoldForm[Information["x * y", Rule[LongForm, False]]]
In[2]:= a; ? x * y //HoldForm//FullForm
Out[2]//FullForm= HoldForm[Information["x * y", Rule[LongForm, False]]]
In[3]:= a \[NewLine] ? x * y //HoldForm//FullForm
Out[3]= a
Out[3]//FullForm= HoldForm[Information["x * y", Rule[LongForm, False]]]
On the other hand, if an element comes before ?, then is interpreted as a PatternTest:
In[4]:= "O" ? x * y //HoldForm//FullForm
Out[4]//FullForm= HoldForm[Times[PatternTest["O", x], y]]
In[5]:= 3 ? x * y //HoldForm//FullForm
Out[5]//FullForm= HoldForm[Times[PatternTest[3, x], y]]
In[6]:= x ? x * y //HoldForm//FullForm
Out[6]//FullForm= HoldForm[Times[PatternTest[x, x], y]]
Update: ? is also interpreted as Question when it is the first operator inside a parenthesis:
In[7]:= a (? s) //FullForm
Out[7]//FullForm= Times[a, Missing["UnknownSymbol", "s"]]
or it is the first token inside a part of a sequence:
In[8]:= F[a,?b] //FullForm
Out[8]//FullForm= F[a, Missing["UnknownSymbol", "b"]]
In[9]:= a {?b} //Hold//FullForm
Out[9]//FullForm= Hold[Times[a, List[Information["b", Rule[LongForm, False]]]]]
There was a problem hiding this comment.
The rule seems to be that
?is interpreted ad a "Question" if is the first tag in the command (apart from spaces). For example:
This is correct when describing things from an operational level.
However, conceptually, here is how the Mathics3 parser understands this. There is a binary infix "?" and a prefix unary "?". The argument of a prefix unary "?" is a name pattern, while for binary infix operators, the operands are expressions.
And the reason it is important to understand this at a conceptual level rather than at an operational level is that when programming a solution to the problem, one can code (hack) something that tries to address the operational situation, oblivious of the conceptual problem. These kinds of solutions tend to be more complicated, harder to understand, more likely to be incomplete, more fragile, and just plain more code.
The reason I mention this is that we've seen operational but non-conceptual hacks, especially in the code for the scanner and parser.
I'll have a solution for this pretty soon.
There was a problem hiding this comment.
Begin rant about the code.
Even after several years working on this, I am still amazed and saddened at how often, when we try to address a very simple problem, there is a lot of code rewriting and sometimes even flaws in design that have to be corrected. So there is a lot of code rewriting.
The lack of documentation and vagueness behind why the code was written the way it is often hides the flaws in the concept and intent behind the coding. It makes it harder to even discuss ways to approach a problem because you really can't be certain what the initial approach was.
End rant about the code and onto discussion
What I am coming to understand from the current scanner is that token mode switching (such as between expression, file name, and now name pattern) was hitherto assumed to be something handled strictly in the parser. As noted with handling binary versus unary ?, the parser needs to inform the scanner which pattern to use in scanning, specifically, whether we need a name pattern or an expression pattern.
From an operational standpoint, _change_token_scanning_mode should have its leading underscore dropped, since this should be noted as a public (specifically parser-accessible) function.
There was a problem hiding this comment.
The rule seems to be that
?is interpreted ad a "Question" if is the first tag in the command (apart from spaces). For example:This is correct when describing things from an operational level.
For me at least, understanding things at the operational level is useful for arriving at the conceptual level.
However, conceptually, here is how the Mathics3 parser understands this. There is a binary infix "?" and a prefix unary "?". The argument of a prefix unary "?" is a name pattern, while for binary infix operators, the operands are expressions.
Looking at the previous examples and the ones I just added, the decision between a prefix or infix interpretation depends on what came before: if there is a valid argument for an infix operator, then it should be an infix operator. Otherwise, is a prefix operator.
And the reason it is important to understand this at a conceptual level rather than at an operational level is that when programming a solution to the problem, one can code (hack) something that tries to address the operational situation, oblivious of the conceptual problem. These kinds of solutions tend to be more complicated, harder to understand, more likely to be incomplete, more fragile, and just plain more code.
Yes, I see that.
The reason I mention this is that we've seen operational but non-conceptual hacks, especially in the code for the scanner and parser.
I'll have a solution for this pretty soon.
Great!
There was a problem hiding this comment.
For me at least, understanding things at the operational level is useful for arriving at the conceptual level.
Of course! It is hard, if not impossible, to understand conceptual flaws without having detailed behavior.
That leads to a second aspect of problem-solving: narrowing the data to the smallest example that shows the problem. Having lots of data is important to confirm behavior, but when communicating an idea, whittling the data to the smallest example that shows the boundaries helps to convey the concept.
(In a paper, the full details are relegated to footnotes and endnotes.)
| ">": ("PutAppend", "Put", "GreaterEqual", "Greater"), | ||
| "?": ("Information", "PatternTest"), | ||
| "=": ( | ||
| # Note that "Set" has to come last. |
There was a problem hiding this comment.
There are several comments like this, which do not say why it has to come last: the reason - I guess- is that shorter sequences should come last to ensure that a part of the largest pattern is not confused with the shortest pattern. This behavior could be reinforced at the end of the code, by doing something like
# Ensure that longer tokens come first
literal_tokens = {key: sorted(val, key=lambda x:-len(x) for key, val in literal_tokens}
or with an assertion
# Ensure that longer tokens come first
assert len({key: val for key, val in literal_tokens if sorted(val, key=lambda x:-len(x))!=val})==0, "Some tokens were not properly sorted. Longer literals should come first."
There was a problem hiding this comment.
There are several comments like this, which do not say why it has to come last: the reason - I guess- is that shorter sequences should come last to ensure that a part of the largest pattern is not confused with the shortest pattern.
Are the comments on lines 413-415 sufficient?
# Note the tuple is in priority order. In particular, tokens
# associated with a single character tokens like Factorial (!), has to
# come after both Unequal (!=), and Factorial2 (!!).# Ensure that longer tokens come first assert len({key: val for key, val in literal_tokens if sorted(val, key=lambda x:-len(x))!=val})==0, "Some tokens were not properly sorted. Longer literals should come first."
Good idea! This would make an excellent addition to the tests. Would you care to add this to the tests?
There was a problem hiding this comment.
There are several comments like this, which do not say why it has to come last: the reason - I guess- is that shorter sequences should come last to ensure that a part of the largest pattern is not confused with the shortest pattern.
Are the comments on lines 413-415 sufficient?
I think so
There was a problem hiding this comment.
assert len({key: val for key, val in literal_tokens if sorted(val, key=lambda x:-len(x))!=val})==0, "Some tokens were not properly sorted. Longer literals should come first."
OK, I can add this to the pytest module.
mathics_scanner/tokeniser.py
Outdated
| (\*\^(\+|-)?\d+)? (?# Exponent) | ||
| """ | ||
|
|
||
| # The additional characters that can appear as Names[] metacharacters. |
There was a problem hiding this comment.
Another note: the argument of Names is always a String. It happens that when an expression starts with ? or ??, the subsequent code is parsed as a string, as long as it is a valid input string for Names. As far as I could see, this tokenizer mode only applies only for this two prefix operators.
There was a problem hiding this comment.
Another note: the argument of
Namesis always a String. It happens that when an expression starts with?or??, the subsequent code is parsed as a string, as long as it is a valid input string forNames.
What I was trying to get to convey is that this kind of pattern is described in the WMA documentation under Names, and the Information docs, which have a form that can use this unquoted reference the Names section.
As far as I could see, this tokenizer mode only applies only for this two prefix operators.
Yes, that kind of custom behavior is a feature of WMA syntax. The FileName syntax also has very limited use in Get, Put, and PutAppend.
| # However, in standalone token reading, that is, without the parser, | ||
| # having this will give more correct answers. In particular, | ||
| # it makes mathics3-tokens give more correct answers, and | ||
| # test_tokeniser has a test that ??X identifies X as a NamePattern. |
There was a problem hiding this comment.
Right now (in master), what the parser does is just wrong. In my patch, I assume that the tokenizer returns a String or a Symbol as the following token. What should happen is that the tokenizer returns always a String as the token following a Question/QuestionQuestion token.
There was a problem hiding this comment.
Right now (in master), what the parser does is just wrong. In my patch, I assume that the tokenizer returns a String or a Symbol as the following token. What should happen is that the tokenizer returns always a String as the token following a Question/QuestionQuestion token.
In the revised behavior, the token tag is a NamePattern which is neither a Symbol nor a String. In my opinion, there nothing wrong with this. Quite the contrary, this is extremely clear. It is more explicit in indicating that when it is considered a String, it is only allowed to have identifier symbols and the NAMES_WILDCARDS characters.
There was a problem hiding this comment.
a NamePattern token is OK. I what is (was?) wrong was to get a Symbol, and otherwise show a "Missing[...]" expression.
Co-authored-by: Juan Mauricio Matera <matera@fisica.unlp.edu.ar>
name-pattern token used in prefix operators ? and ??
?and??.change_token_scanning_modemade public.; it is used by the parser.Informationtoken is nowQuestionQuestion.