Skip to content

149 rehaul update pydantic ai#150

Merged
bernomone merged 5 commits into
mainfrom
149-rehaul-update-pydantic-ai
Jan 18, 2026
Merged

149 rehaul update pydantic ai#150
bernomone merged 5 commits into
mainfrom
149-rehaul-update-pydantic-ai

Conversation

@bernomone
Copy link
Copy Markdown
Collaborator

  • Update pydantic ai and pydantic to the latest version
  • Update model usage with the provider:model pattern
  • Fixes tests to work with the new version
  • GEMINI_API_KEY -> GOOGLE_API_KEY
  • Fixed evals to work with new model definition
  • updates both aws bedrock models

Some evals fail with gemini (consistent also with other models):

Running allower evals...
Evaluating case: What is the latest research on quantum computing?
Evaluating case: Hello, how are you?
Evaluating case: Can you summarize the latest papers on AI?
Evaluating case: Tell me a joke about physics.
Evaluating case: What are the implications of quantum entanglement?
Evaluating case: What is the meaning of life?
Total cases: 6
✅ Passed: 6

Running orchestrator evals...
Evaluating case: What is the latest research on quantum computing?
Evaluating case: Can you summarize the latest papers on AI?
Evaluating case: What's the relation between context length and quality in LLM performance?
Evaluating case: Tell me all about the paper 'Attention is all you need'
Evaluating case: How do I design a good research methodology?
Evaluating case: What are the key principles of academic writing?
Evaluating case: Explain the concept of statistical significance
Evaluating case: What's the difference between quantitative and qualitative research?
Evaluating case: How should I structure a literature review?
Total cases: 9
✅ Passed: 9

Running summary evals...
Evaluating case: What is the latest research on quantum field theory?
Evaluating case: Can you summarize the latest papers on AI?
Evaluating case: Tell me all about the recent work in Bayesian statistics?
Error: list index out of range
Evaluating case: Tell me all about the recent work in Bayesian statistics?
Error: list index out of range
Evaluating case: Tell me all about the recent work in Bayesian statistics?
Error: list index out of range
Evaluating case: Tell me all about the recent work in Bayesian statistics?
Error: list index out of range
Evaluating case: Tell me all about the recent work in Bayesian statistics?
Error: list index out of range
Max attempts reached for question: Tell me all about the recent work in Bayesian statistics?
Total cases: 3
✅ Passed: 2

Running question evals...
Evaluating case: Who solved Fermat's last theorem?
Evaluating case: In which experimental framework did AlphaFold2 demonstrate high capability in predicting protein structure?
Test failed for question: In which experimental framework did AlphaFold2 demonstrate high capability in predicting protein structure?
Got: According to the article, AlphaFold2 demonstrated high capability in predicting protein structure in the Critical Assessment of Techniques for Protein Structure Prediction (CASP) framework.

As stated in the article: "Recently, the artificial intelligence (AI) system AlphaFold developed by Google's DeepMind dominated the Critical Assessment of Techniques for Protein Structure Prediction (CASP) twice."

The article further explains that "AlphaFold 2, the version under consideration here, is a deep learning system that incorporates training procedures based on the evolutionary, physical, and geometric constraints of protein structures. It features iterative refinement of predictions and allows for learning from unlabeled protein sequences using self-distillation and self-estimates of accuracy to directly predict the 3D coordinates of all heavy atoms for a given protein using the primary structure and aligned sequences of homologues."

This demonstrates that AlphaFold2's success in CASP was based on its sophisticated deep learning approach that integrates multiple types of constraints and refinement procedures to achieve accurate protein structure predictions.
Evaluating case: In which year did AlexNet come out?
Evaluating case: What percentage of DNA has been found to be shared between Sapiens and Neandertals?
Total cases: 4
✅ Passed: 3,❌ Failed: 1
✅ Passed: 3

Running article evals...
Evaluating case: Tell me about paper 'Entity Embeddings of Categorical Variables'
Evaluating case: What is paper 'The deterministic Kermack-McKendrick model bounds the general stochastic epidemic' about?
Evaluating case: Tell me about paper 'https://arxiv.org/pdf/1604.06737'
Evaluating case: What is paper https://arxiv.org/pdf/1602.01730 about?
Evaluating case: Find this paper 'Quark Gluon plasma and AI'
Test failed for question: Find this paper 'Quark Gluon plasma and AI'
Got: Foundations of GenIR and http://arxiv.org/pdf/2501.02842v1
Expected: Hydrodynamic Description of the Quark-Gluon Plasma and http://arxiv.org/pdf/2311.10621v2

Total cases: 5
✅ Passed: 4,❌ Failed: 1
✅ Passed: 4

Running general agent evals...
Running General Agent Evaluations for claude-aws-bedrock

Testing keyword relevance...
Evaluating: Research methodology guidance
Request: How do I design a good research methodology for machine lear...
✓ PASS - Found 5/5 keywords
Evaluating: Academic writing guidance
Request: What are the key principles of academic writing?...
✓ PASS - Found 4/4 keywords
Evaluating: Concept explanation
Request: Explain the concept of statistical significance in simple te...
✓ PASS - Found 4/5 keywords
Evaluating: Methodological comparison
Request: What's the difference between quantitative and qualitative r...
✓ PASS - Found 5/5 keywords
Evaluating: Academic guidance
Request: How should I structure a literature review for my thesis?...
✓ PASS - Found 5/5 keywords
Evaluating: Interdisciplinary research
Request: What are some interdisciplinary approaches to studying clima...
✓ PASS - Found 4/4 keywords
Evaluating: Statistical guidance
Request: How do I choose the right statistical test for my data?...
✓ PASS - Found 5/5 keywords
Evaluating: Ethics in research
Request: What are the ethical considerations in AI research?...
✓ PASS - Found 5/5 keywords

Testing flexibility and adaptability...
Evaluating: Research paradigm guidance
Request: I'm confused about which research paradigm to use for my soc...
✓ PASS - Substantive response (2288 chars)

  • Used 3 sources
  • Provided 5 follow-ups
    Evaluating: Academic process explanation
    Request: Can you help me understand the peer review process?...
    ✓ PASS - Substantive response (3027 chars)
  • Used 6 sources
  • Provided 5 follow-ups
    Evaluating: Field overview request
    Request: What are the current debates in computational linguistics?...
    ✓ PASS - Substantive response (3408 chars)
  • Used 8 sources
  • Provided 4 follow-ups
    Evaluating: Academic problem-solving
    Request: How do I deal with contradictory findings in my literature r...
    ✓ PASS - Substantive response (3763 chars)
  • Provided 4 follow-ups
    Evaluating: Academic communication
    Request: What's the best way to present negative results in a researc...
    ✓ PASS - Substantive response (3018 chars)
  • Used 3 sources
  • Provided 4 follow-ups
    Evaluating: Conceptual comparison
    Request: I need help understanding Bayesian vs frequentist statistics...
    ✓ PASS - Substantive response (2566 chars)
  • Used 3 sources
  • Provided 4 follow-ups
    Evaluating: Academic writing guidance
    Request: How do I write a compelling research proposal?...
    ✓ PASS - Substantive response (3588 chars)
  • Used 2 sources
  • Provided 4 follow-ups
    Evaluating: Current trends inquiry
    Request: What are the emerging trends in data visualization for scien...
    ✓ PASS - Substantive response (3067 chars)
  • Used 6 sources
  • Provided 5 follow-ups

General Agent Evaluation Summary
Total cases: 16
✓ Passed: 16
All tests passed! 🎉

Running allower evals...
Evaluating case: What is the latest research on quantum computing?
Evaluating case: Hello, how are you?
Evaluating case: Can you summarize the latest papers on AI?
Evaluating case: Tell me a joke about physics.
Evaluating case: What are the implications of quantum entanglement?
Evaluating case: What is the meaning of life?
Total cases: 6
✅ Passed: 6

@bernomone bernomone merged commit eb53697 into main Jan 18, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants