149 rehaul update pydantic ai#150
Merged
Merged
Conversation
martinapugliese
approved these changes
Jan 18, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Some evals fail with gemini (consistent also with other models):
Running allower evals...
Evaluating case: What is the latest research on quantum computing?
Evaluating case: Hello, how are you?
Evaluating case: Can you summarize the latest papers on AI?
Evaluating case: Tell me a joke about physics.
Evaluating case: What are the implications of quantum entanglement?
Evaluating case: What is the meaning of life?
Total cases: 6
✅ Passed: 6
Running orchestrator evals...
Evaluating case: What is the latest research on quantum computing?
Evaluating case: Can you summarize the latest papers on AI?
Evaluating case: What's the relation between context length and quality in LLM performance?
Evaluating case: Tell me all about the paper 'Attention is all you need'
Evaluating case: How do I design a good research methodology?
Evaluating case: What are the key principles of academic writing?
Evaluating case: Explain the concept of statistical significance
Evaluating case: What's the difference between quantitative and qualitative research?
Evaluating case: How should I structure a literature review?
Total cases: 9
✅ Passed: 9
Running summary evals...
Evaluating case: What is the latest research on quantum field theory?
Evaluating case: Can you summarize the latest papers on AI?
Evaluating case: Tell me all about the recent work in Bayesian statistics?
Error: list index out of range
Evaluating case: Tell me all about the recent work in Bayesian statistics?
Error: list index out of range
Evaluating case: Tell me all about the recent work in Bayesian statistics?
Error: list index out of range
Evaluating case: Tell me all about the recent work in Bayesian statistics?
Error: list index out of range
Evaluating case: Tell me all about the recent work in Bayesian statistics?
Error: list index out of range
Max attempts reached for question: Tell me all about the recent work in Bayesian statistics?
Total cases: 3
✅ Passed: 2
Running question evals...
Evaluating case: Who solved Fermat's last theorem?
Evaluating case: In which experimental framework did AlphaFold2 demonstrate high capability in predicting protein structure?
Test failed for question: In which experimental framework did AlphaFold2 demonstrate high capability in predicting protein structure?
Got: According to the article, AlphaFold2 demonstrated high capability in predicting protein structure in the Critical Assessment of Techniques for Protein Structure Prediction (CASP) framework.
As stated in the article: "Recently, the artificial intelligence (AI) system AlphaFold developed by Google's DeepMind dominated the Critical Assessment of Techniques for Protein Structure Prediction (CASP) twice."
The article further explains that "AlphaFold 2, the version under consideration here, is a deep learning system that incorporates training procedures based on the evolutionary, physical, and geometric constraints of protein structures. It features iterative refinement of predictions and allows for learning from unlabeled protein sequences using self-distillation and self-estimates of accuracy to directly predict the 3D coordinates of all heavy atoms for a given protein using the primary structure and aligned sequences of homologues."
This demonstrates that AlphaFold2's success in CASP was based on its sophisticated deep learning approach that integrates multiple types of constraints and refinement procedures to achieve accurate protein structure predictions.
Evaluating case: In which year did AlexNet come out?
Evaluating case: What percentage of DNA has been found to be shared between Sapiens and Neandertals?
Total cases: 4
✅ Passed: 3,❌ Failed: 1
✅ Passed: 3
Running article evals...
Evaluating case: Tell me about paper 'Entity Embeddings of Categorical Variables'
Evaluating case: What is paper 'The deterministic Kermack-McKendrick model bounds the general stochastic epidemic' about?
Evaluating case: Tell me about paper 'https://arxiv.org/pdf/1604.06737'
Evaluating case: What is paper https://arxiv.org/pdf/1602.01730 about?
Evaluating case: Find this paper 'Quark Gluon plasma and AI'
Test failed for question: Find this paper 'Quark Gluon plasma and AI'
Got: Foundations of GenIR and http://arxiv.org/pdf/2501.02842v1
Expected: Hydrodynamic Description of the Quark-Gluon Plasma and http://arxiv.org/pdf/2311.10621v2
Total cases: 5
✅ Passed: 4,❌ Failed: 1
✅ Passed: 4
Running general agent evals...
Running General Agent Evaluations for claude-aws-bedrock
Testing keyword relevance...
Evaluating: Research methodology guidance
Request: How do I design a good research methodology for machine lear...
✓ PASS - Found 5/5 keywords
Evaluating: Academic writing guidance
Request: What are the key principles of academic writing?...
✓ PASS - Found 4/4 keywords
Evaluating: Concept explanation
Request: Explain the concept of statistical significance in simple te...
✓ PASS - Found 4/5 keywords
Evaluating: Methodological comparison
Request: What's the difference between quantitative and qualitative r...
✓ PASS - Found 5/5 keywords
Evaluating: Academic guidance
Request: How should I structure a literature review for my thesis?...
✓ PASS - Found 5/5 keywords
Evaluating: Interdisciplinary research
Request: What are some interdisciplinary approaches to studying clima...
✓ PASS - Found 4/4 keywords
Evaluating: Statistical guidance
Request: How do I choose the right statistical test for my data?...
✓ PASS - Found 5/5 keywords
Evaluating: Ethics in research
Request: What are the ethical considerations in AI research?...
✓ PASS - Found 5/5 keywords
Testing flexibility and adaptability...
Evaluating: Research paradigm guidance
Request: I'm confused about which research paradigm to use for my soc...
✓ PASS - Substantive response (2288 chars)
Evaluating: Academic process explanation
Request: Can you help me understand the peer review process?...
✓ PASS - Substantive response (3027 chars)
Evaluating: Field overview request
Request: What are the current debates in computational linguistics?...
✓ PASS - Substantive response (3408 chars)
Evaluating: Academic problem-solving
Request: How do I deal with contradictory findings in my literature r...
✓ PASS - Substantive response (3763 chars)
Evaluating: Academic communication
Request: What's the best way to present negative results in a researc...
✓ PASS - Substantive response (3018 chars)
Evaluating: Conceptual comparison
Request: I need help understanding Bayesian vs frequentist statistics...
✓ PASS - Substantive response (2566 chars)
Evaluating: Academic writing guidance
Request: How do I write a compelling research proposal?...
✓ PASS - Substantive response (3588 chars)
Evaluating: Current trends inquiry
Request: What are the emerging trends in data visualization for scien...
✓ PASS - Substantive response (3067 chars)
General Agent Evaluation Summary
Total cases: 16
✓ Passed: 16
All tests passed! 🎉
Running allower evals...
Evaluating case: What is the latest research on quantum computing?
Evaluating case: Hello, how are you?
Evaluating case: Can you summarize the latest papers on AI?
Evaluating case: Tell me a joke about physics.
Evaluating case: What are the implications of quantum entanglement?
Evaluating case: What is the meaning of life?
Total cases: 6
✅ Passed: 6