Skip to content

Conversation

@avirajsingh7
Copy link
Collaborator

@avirajsingh7 avirajsingh7 commented Dec 10, 2025

Summary

This change refactors the evaluation run process to utilize a stored configuration instead of a configuration dictionary. It introduces fields for config_id, config_version, and model in the evaluation run table, streamlining the evaluation process and improving data integrity.

Checklist

Before submitting a pull request, please ensure that you mark these tasks.

  • Ran fastapi run --reload app/main.py or docker compose up in the repository root and test.
  • If you've fixed a bug or added code that is tested and has test cases.

Summary by CodeRabbit

  • New Features

    • Evaluation runs now track and store the LLM model used for each evaluation.
    • Configuration is now referenced by ID and version instead of storing complete configurations inline, improving efficiency and maintainability.
  • Bug Fixes

    • Enhanced validation and error handling for missing or invalid configurations during evaluation setup.
  • Chores

    • Database schema updated to support configuration references.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Dec 10, 2025

📝 Walkthrough

Walkthrough

The PR refactors evaluation configuration handling to use stored configuration references instead of inline config objects. The evaluate endpoint now accepts config_id and config_version parameters, resolving the stored configuration via ConfigVersionCrud and resolve_config_blob. The EvaluationRun model is updated to store references and a resolved model field. Database schema is migrated with appropriate foreign keys. Tests are updated to use the new configuration reference pattern.

Changes

Cohort / File(s) Summary
Evaluation API & Route Handler
backend/app/api/routes/evaluation.py
Refactored evaluate() method to accept config_id: UUID and config_version: int instead of config: dict and assistant_id. Added config resolution via ConfigVersionCrud, provider validation (OPENAI only), and HTTP error handling for missing/invalid configs. Updated batch evaluation to use resolved config parameters.
Core CRUD Operations
backend/app/crud/evaluations/core.py
Updated create_evaluation_run() signature to accept config_id and config_version instead of config dict. Added new resolve_model_from_config() function to extract model name from stored configuration. Updated logging and docstrings. Added imports for UUID, ConfigVersionCrud, LLMCallConfig, and resolve_config_blob.
Data Model Definitions
backend/app/models/evaluation.py
Replaced config: dict[str, Any] with config_id: UUID | None, config_version: int | None, and new model: str | None field in both EvaluationRun and EvaluationRunPublic to reflect stored config references and resolved model.
Processing & Embeddings
backend/app/crud/evaluations/processing.py, backend/app/crud/evaluations/embeddings.py
Updated model resolution in processing flow to use new resolve_model_from_config(). Hard-coded embedding model to "text-embedding-3-large" in embeddings batch handler, removing dynamic retrieval.
Module Exports
backend/app/crud/evaluations/__init__.py
Added resolve_model_from_config to public API exports via __all__ list.
Database Migration
backend/app/alembic/versions/041_add_config_in_evals_run_table.py
New migration adds config_id (UUID, foreign key to config table) and config_version (Integer) columns to evaluation_run table; removes legacy config JSONB column. Includes downgrade path.
Test Suite
backend/app/tests/api/routes/test_evaluation.py
Updated test cases to create test configs via create_test_config() and reference via config_id/config_version instead of embedding full config objects. Added uuid4 usage for negative test scenarios. Updated error message assertions for config-not-found handling.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested reviewers

  • Prajna1999
  • kartpop

Poem

🐰 Hops with glee
Config stored, no more to carry,
Just an ID, oh how merry!
Versions tracked with careful care,
References bloom through the air! ✨
Resolution flows so clean and bright,
Configurations bundled just right! 🌟

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 68.75% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main refactoring: evaluation runs now use stored configuration management (config_id/config_version) instead of inline config dicts.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@avirajsingh7 avirajsingh7 linked an issue Dec 10, 2025 that may be closed by this pull request
@avirajsingh7 avirajsingh7 self-assigned this Dec 10, 2025
@avirajsingh7 avirajsingh7 added enhancement New feature or request ready-for-review labels Dec 10, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (4)
backend/app/crud/evaluations/embeddings.py (1)

366-367: Misleading comment - update to reflect actual behavior.

The comment says "Get embedding model from config" but the code hardcodes the value. Update the comment to accurately describe the implementation.

-        # Get embedding model from config (default: text-embedding-3-large)
-        embedding_model = "text-embedding-3-large"
+        # Use fixed embedding model (text-embedding-3-large)
+        embedding_model = "text-embedding-3-large"
backend/app/tests/api/routes/test_evaluation.py (1)

524-545: Consider renaming function to match its new purpose.

The function test_start_batch_evaluation_missing_model was repurposed to test invalid config_id scenarios. The docstring was updated but the function name still references "missing_model". Consider renaming for clarity.

-    def test_start_batch_evaluation_missing_model(self, client, user_api_key_header):
-        """Test batch evaluation fails with invalid config_id."""
+    def test_start_batch_evaluation_invalid_config_id(self, client, user_api_key_header):
+        """Test batch evaluation fails with invalid config_id."""
backend/app/api/routes/evaluation.py (1)

499-510: Consider validating that model is present in config params.

The model is extracted with .get("model") which returns None if not present. Since model is critical for cost tracking (used in create_langfuse_dataset_run), consider validating its presence and returning an error if missing.

     # Extract model from config for storage
     model = config.completion.params.get("model")
+    if not model:
+        raise HTTPException(
+            status_code=400,
+            detail="Config must specify a 'model' in completion params for evaluation",
+        )

     # Create EvaluationRun record with config references
backend/app/crud/evaluations/core.py (1)

15-69: Config-based create_evaluation_run refactor is correctly implemented; consider logging model for improved traceability.

The refactor from inline config dict to config_id: UUID and config_version: int is properly implemented throughout:

  • The sole call site in backend/app/api/routes/evaluation.py:503 correctly passes all new parameters with the right types (config_id as UUID, config_version as int, model extracted from config).
  • The EvaluationRun model in backend/app/models/evaluation.py correctly defines all three fields with appropriate types and descriptions.
  • All type hints align with Python 3.11+ guidelines.

One suggested improvement for debugging:

Include model in the creation log for better traceability when correlating evaluation runs with model versions:

logger.info(
    f"Created EvaluationRun record: id={eval_run.id}, run_name={run_name}, "
-   f"config_id={config_id}, config_version={config_version}"
+   f"config_id={config_id}, config_version={config_version}, model={model}"
)

Since the model is already extracted at the call site and passed to the function, including it in the log will provide fuller context for operational debugging without any additional cost.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 30ef268 and d5f9d4d.

📒 Files selected for processing (7)
  • backend/app/alembic/versions/7b48f23ebfdd_add_config_id_and_version_in_evals_run_.py (1 hunks)
  • backend/app/api/routes/evaluation.py (5 hunks)
  • backend/app/crud/evaluations/core.py (5 hunks)
  • backend/app/crud/evaluations/embeddings.py (1 hunks)
  • backend/app/crud/evaluations/processing.py (1 hunks)
  • backend/app/models/evaluation.py (3 hunks)
  • backend/app/tests/api/routes/test_evaluation.py (5 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use type hints in Python code (Python 3.11+ project)

Files:

  • backend/app/api/routes/evaluation.py
  • backend/app/models/evaluation.py
  • backend/app/crud/evaluations/embeddings.py
  • backend/app/tests/api/routes/test_evaluation.py
  • backend/app/crud/evaluations/processing.py
  • backend/app/crud/evaluations/core.py
  • backend/app/alembic/versions/7b48f23ebfdd_add_config_id_and_version_in_evals_run_.py
backend/app/api/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Expose FastAPI REST endpoints under backend/app/api/ organized by domain

Files:

  • backend/app/api/routes/evaluation.py
backend/app/models/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Define SQLModel entities (database tables and domain objects) in backend/app/models/

Files:

  • backend/app/models/evaluation.py
backend/app/crud/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Implement database access operations in backend/app/crud/

Files:

  • backend/app/crud/evaluations/embeddings.py
  • backend/app/crud/evaluations/processing.py
  • backend/app/crud/evaluations/core.py
🧬 Code graph analysis (2)
backend/app/tests/api/routes/test_evaluation.py (2)
backend/app/crud/evaluations/batch.py (1)
  • build_evaluation_jsonl (62-115)
backend/app/models/evaluation.py (2)
  • EvaluationDataset (74-130)
  • EvaluationRun (133-248)
backend/app/crud/evaluations/processing.py (1)
backend/app/crud/evaluations/langfuse.py (1)
  • create_langfuse_dataset_run (20-163)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: checks (3.11.7, 6)
🔇 Additional comments (3)
backend/app/crud/evaluations/processing.py (1)

257-264: LGTM! Clean refactor to use stored model field.

The change correctly retrieves the model from eval_run.model instead of extracting it from config. This aligns with the new data model where the model is snapshotted at evaluation creation time.

backend/app/models/evaluation.py (1)

148-157: LGTM! Well-structured config reference fields.

The new config_id and config_version fields properly establish the relationship to stored configs with appropriate constraints (ge=1 for version). The nullable design allows backward compatibility with existing data.

backend/app/api/routes/evaluation.py (1)

478-495: LGTM! Robust config resolution with provider validation.

The config resolution flow properly validates that the stored config exists and uses the OPENAI provider. Error handling returns appropriate HTTP 400 responses with descriptive messages.

@codecov
Copy link

codecov bot commented Dec 10, 2025

Codecov Report

❌ Patch coverage is 56.09756% with 18 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
backend/app/crud/evaluations/core.py 35.71% 9 Missing ⚠️
backend/app/api/routes/evaluation.py 46.15% 7 Missing ⚠️
backend/app/crud/evaluations/embeddings.py 0.00% 1 Missing ⚠️
backend/app/crud/evaluations/processing.py 50.00% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
backend/app/models/evaluation.py (1)

148-158: Align EvaluationRun type hints with nullable DB columns for config fields

config_id and config_version are nullable in the schema but annotated as non-optional types. This can mislead callers and type checkers into assuming they’re always present, even for legacy runs or transitional data.

Consider updating the annotations to reflect nullability:

-    config_id: UUID = SQLField(
+    config_id: UUID | None = SQLField(
         foreign_key="config.id",
         nullable=True,
         description="Reference to the stored config used for this evaluation",
     )
-    config_version: int = SQLField(
+    config_version: int | None = SQLField(
         nullable=True,
         ge=1,
         description="Version of the config used for this evaluation",
     )

This keeps the schema the same while making runtime and type expectations clearer.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d5f9d4d and eda7762.

📒 Files selected for processing (1)
  • backend/app/models/evaluation.py (3 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use type hints in Python code (Python 3.11+ project)

Files:

  • backend/app/models/evaluation.py
backend/app/models/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Define SQLModel entities (database tables and domain objects) in backend/app/models/

Files:

  • backend/app/models/evaluation.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: checks (3.11.7, 6)
🔇 Additional comments (1)
backend/app/models/evaluation.py (1)

271-273: Public model nullability now matches the schema

Making config_id, config_version, and model nullable in EvaluationRunPublic correctly reflects the DB fields and avoids validation issues for existing rows. This resolves the earlier mismatch between the table and the public model.

Copy link
Collaborator

@Prajna1999 Prajna1999 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@avirajsingh7
Copy link
Collaborator Author

hold merge- untill frontend is ready.

@Prajna1999
Copy link
Collaborator

good to go. Can be merged

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
backend/app/api/routes/evaluation.py (1)

8-16: Critical: Missing Depends import.

The pipeline failure indicates NameError: name 'Depends' is not defined at line 130. The import statement for Depends from fastapi is missing.

🐛 Add missing import
 from fastapi import (
     APIRouter,
     Body,
+    Depends,
     File,
     Form,
     HTTPException,
     Query,
     UploadFile,
 )
🤖 Fix all issues with AI agents
In @backend/app/alembic/versions/041_add_config_in_evals_run_table.py:
- Line 20: The migration functions upgrade() and downgrade() lack return type
hints; update both function definitions (upgrade and downgrade) to include
explicit return types (e.g., change "def upgrade():" and "def downgrade():" to
"def upgrade() -> None:" and "def downgrade() -> None:") so they conform to the
project's typing guidelines.
- Around line 47-56: The downgrade currently re-adds the "config" column on the
"evaluation_run" table using op.add_column with sa.Column(..., nullable=False)
which will fail if rows exist; update that op.add_column call in the downgrade
to use nullable=True (or alternatively add a server_default or a prior data
migration to populate values before setting non-nullable), ensuring the column
is created nullable during downgrade to avoid PostgreSQL errors.

In @backend/app/api/routes/evaluation.py:
- Around line 505-509: The code references a non-existent constant
LLMProvider.OPENAI in the evaluation config validation, causing AttributeError;
update the check in evaluation.py (the block that raises HTTPException) to
compare against the actual provider string "openai" (i.e., use
config.completion.provider != "openai") or alternatively add a new constant
OPENAI = "openai" to the LLMProvider class in
backend/app/services/llm/providers/registry.py so the symbol exists and matches
tests; pick one approach and ensure the error message and tests remain
consistent with the chosen value.

In @backend/app/crud/evaluations/core.py:
- Around line 308-349: resolve_model_from_config currently declares returning
str but assigns model = config.completion.params.get("model") which may be None;
update resolve_model_from_config to validate that model is present and a str
(e.g., if not model: raise ValueError(...) with context including eval_run.id,
config_id, config_version) before returning, or coerce/choose a safe default
only if intended; reference the resolve_model_from_config function and the model
variable from config.completion.params.get("model") when implementing the check.

In @backend/app/crud/evaluations/processing.py:
- Around line 257-263: resolve_model_from_config currently uses
config.get("model") which can return None despite its str return annotation and
docstring promise; modify resolve_model_from_config to validate the retrieved
value and raise ValueError if missing (or alternatively change the function
signature to return str | None and update callers), e.g., after fetching model =
config.get("model") check if model is truthy and raise ValueError("missing model
in config") to enforce the contract so callers like
resolve_model_from_config(session=session, eval_run=eval_run) always receive a
str or an explicit None-aware type is used consistently.
🧹 Nitpick comments (1)
backend/app/alembic/versions/041_add_config_in_evals_run_table.py (1)

1-60: Consider a multi-step migration strategy for safer deployment.

Given the destructive nature of this schema change (dropping the config column) and the PR status ("hold merge - until frontend is ready"), consider deploying this as a multi-phase migration:

Phase 1: Add new columns without dropping old ones

  • Add config_id and config_version (nullable)
  • Add foreign key constraint
  • Deploy application code that writes to both old and new columns

Phase 2: Backfill existing data

  • Create a data migration script to populate config_id/config_version from existing config JSONB
  • Validate data integrity

Phase 3: Cut over

  • Deploy application code that only uses new columns
  • Monitor for issues

Phase 4: Cleanup

  • Drop the old config column in a subsequent migration

This approach provides:

  • Zero downtime deployment
  • Easy rollback at each phase
  • Data preservation and validation
  • Safer production deployment
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between eda7762 and 31d9523.

📒 Files selected for processing (8)
  • backend/app/alembic/versions/041_add_config_in_evals_run_table.py
  • backend/app/api/routes/evaluation.py
  • backend/app/crud/evaluations/__init__.py
  • backend/app/crud/evaluations/core.py
  • backend/app/crud/evaluations/embeddings.py
  • backend/app/crud/evaluations/processing.py
  • backend/app/models/evaluation.py
  • backend/app/tests/api/routes/test_evaluation.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • backend/app/crud/evaluations/embeddings.py
  • backend/app/models/evaluation.py
🧰 Additional context used
📓 Path-based instructions (4)
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Always add type hints to all function parameters and return values in Python code
Prefix all log messages with the function name in square brackets: logger.info(f"[function_name] Message {mask_string(sensitive_value)}")
Use Python 3.11+ with type hints throughout the codebase

Files:

  • backend/app/tests/api/routes/test_evaluation.py
  • backend/app/alembic/versions/041_add_config_in_evals_run_table.py
  • backend/app/api/routes/evaluation.py
  • backend/app/crud/evaluations/__init__.py
  • backend/app/crud/evaluations/processing.py
  • backend/app/crud/evaluations/core.py
backend/app/tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use factory pattern for test fixtures in backend/app/tests/

Files:

  • backend/app/tests/api/routes/test_evaluation.py
backend/app/alembic/versions/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Generate database migrations using alembic revision --autogenerate -m "Description" --rev-id <number> where rev-id is the latest existing revision ID + 1

Files:

  • backend/app/alembic/versions/041_add_config_in_evals_run_table.py
backend/app/api/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

backend/app/api/**/*.py: Define FastAPI REST endpoints in backend/app/api/ organized by domain
Load Swagger endpoint descriptions from external markdown files instead of inline strings using load_description("domain/action.md")

Files:

  • backend/app/api/routes/evaluation.py
🧠 Learnings (2)
📚 Learning: 2025-12-17T15:39:30.469Z
Learnt from: CR
Repo: ProjectTech4DevAI/kaapi-backend PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-17T15:39:30.469Z
Learning: Applies to backend/app/alembic/versions/*.py : Generate database migrations using `alembic revision --autogenerate -m "Description" --rev-id <number>` where rev-id is the latest existing revision ID + 1

Applied to files:

  • backend/app/alembic/versions/041_add_config_in_evals_run_table.py
📚 Learning: 2025-12-17T15:39:30.469Z
Learnt from: CR
Repo: ProjectTech4DevAI/kaapi-backend PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-17T15:39:30.469Z
Learning: Organize backend code in `backend/app/` following the layered architecture: Models, CRUD, Routes, Core, Services, and Celery directories

Applied to files:

  • backend/app/api/routes/evaluation.py
🧬 Code graph analysis (5)
backend/app/tests/api/routes/test_evaluation.py (3)
backend/app/crud/evaluations/batch.py (1)
  • build_evaluation_jsonl (62-115)
backend/app/models/evaluation.py (2)
  • EvaluationDataset (74-168)
  • EvaluationRun (171-322)
backend/app/tests/utils/test_data.py (1)
  • create_test_config (239-302)
backend/app/api/routes/evaluation.py (6)
backend/app/crud/config/version.py (1)
  • ConfigVersionCrud (15-142)
backend/app/models/llm/request.py (1)
  • LLMCallConfig (132-188)
backend/app/services/llm/jobs.py (1)
  • resolve_config_blob (84-116)
backend/app/services/llm/providers/registry.py (1)
  • LLMProvider (14-41)
backend/app/utils.py (4)
  • APIResponse (33-57)
  • get_langfuse_client (212-248)
  • get_openai_client (179-209)
  • load_description (393-398)
backend/app/crud/evaluations/core.py (1)
  • create_evaluation_run (18-71)
backend/app/crud/evaluations/__init__.py (1)
backend/app/crud/evaluations/core.py (1)
  • resolve_model_from_config (308-349)
backend/app/crud/evaluations/processing.py (2)
backend/app/crud/evaluations/core.py (2)
  • update_evaluation_run (154-206)
  • resolve_model_from_config (308-349)
backend/app/crud/evaluations/langfuse.py (1)
  • create_langfuse_dataset_run (21-164)
backend/app/crud/evaluations/core.py (3)
backend/app/crud/config/version.py (1)
  • ConfigVersionCrud (15-142)
backend/app/models/llm/request.py (1)
  • LLMCallConfig (132-188)
backend/app/services/llm/jobs.py (1)
  • resolve_config_blob (84-116)
🪛 GitHub Actions: Kaapi CI
backend/app/api/routes/evaluation.py

[error] 130-130: NameError: name 'Depends' is not defined.

🔇 Additional comments (4)
backend/app/crud/evaluations/__init__.py (1)

8-8: LGTM!

The new resolve_model_from_config function is correctly imported and exported for public use.

Also applies to: 43-43

backend/app/tests/api/routes/test_evaluation.py (1)

3-3: LGTM!

The test updates correctly reflect the shift from inline config dictionaries to stored config references. The use of create_test_config factory function aligns with the coding guidelines for test fixtures, and the error scenarios properly test config-not-found cases.

Also applies to: 10-10, 499-545, 728-803

backend/app/api/routes/evaluation.py (1)

492-509: Verify config resolution error handling covers all failure modes.

The config resolution logic handles errors from resolve_config_blob and validates the provider, but ensure that:

  1. Config version not found scenarios are properly handled
  2. Invalid/corrupted config blobs are caught
  3. The provider validation matches actual config schemas used in production
backend/app/crud/evaluations/core.py (1)

66-69: LGTM!

The logging statement correctly follows the coding guideline format with function context and includes the new config_id and config_version fields.

depends_on = None


def upgrade():
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add return type hints to migration functions.

Both upgrade() and downgrade() functions are missing return type hints.

As per coding guidelines, all functions should have type hints.

📝 Proposed fix
-def upgrade():
+def upgrade() -> None:
-def downgrade():
+def downgrade() -> None:

Also applies to: 45-45

🤖 Prompt for AI Agents
In @backend/app/alembic/versions/041_add_config_in_evals_run_table.py at line
20, The migration functions upgrade() and downgrade() lack return type hints;
update both function definitions (upgrade and downgrade) to include explicit
return types (e.g., change "def upgrade():" and "def downgrade():" to "def
upgrade() -> None:" and "def downgrade() -> None:") so they conform to the
project's typing guidelines.

Comment on lines +22 to +41
op.add_column(
"evaluation_run",
sa.Column(
"config_id",
sa.Uuid(),
nullable=True,
comment="Reference to the stored config used",
),
)
op.add_column(
"evaluation_run",
sa.Column(
"config_version",
sa.Integer(),
nullable=True,
comment="Version of the config used",
),
)
op.create_foreign_key(None, "evaluation_run", "config", ["config_id"], ["id"])
op.drop_column("evaluation_run", "config")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical: Data loss and foreign key constraint naming issues.

This migration has several critical problems:

  1. Data loss: Line 41 drops the config column without migrating existing data to the new config_id/config_version columns. Any existing evaluation runs will lose their configuration data permanently.

  2. Foreign key constraint naming: Line 40 creates a foreign key with None as the constraint name, causing Alembic to auto-generate a name. However, the downgrade function (Line 57) also uses None to drop the constraint, which won't match the auto-generated name and will fail.

Required actions:

  1. Add a data migration step before dropping the config column. You'll need to:

    • Parse each existing config JSONB object
    • Look up or create corresponding config records with appropriate versions
    • Update config_id and config_version for each evaluation_run
    • Or, if data migration isn't feasible, add a comment explaining why data loss is acceptable
  2. Specify an explicit constraint name instead of None:

🔧 Proposed fix for FK constraint naming
-    op.create_foreign_key(None, "evaluation_run", "config", ["config_id"], ["id"])
+    op.create_foreign_key(
+        "fk_evaluation_run_config_id", 
+        "evaluation_run", 
+        "config", 
+        ["config_id"], 
+        ["id"]
+    )

And update the downgrade:

-    op.drop_constraint(None, "evaluation_run", type_="foreignkey")
+    op.drop_constraint("fk_evaluation_run_config_id", "evaluation_run", type_="foreignkey")

Committable suggestion skipped: line range outside the PR's diff.

Comment on lines +47 to +56
op.add_column(
"evaluation_run",
sa.Column(
"config",
postgresql.JSONB(astext_type=sa.Text()),
autoincrement=False,
nullable=False,
comment="Evaluation configuration (model, instructions, etc.)",
),
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical: Downgrade will fail with existing data.

The downgrade re-adds the config column with nullable=False (Line 53). If the evaluation_run table contains any records when downgrading, this operation will fail because PostgreSQL cannot add a non-nullable column to a table with existing rows without specifying a default value.

Either:

  1. Make the column nullable during downgrade: nullable=True
  2. Provide a server default value
  3. Add a data migration to populate the column before setting it non-nullable
🔧 Proposed fix (Option 1: Make nullable)
     op.add_column(
         "evaluation_run",
         sa.Column(
             "config",
             postgresql.JSONB(astext_type=sa.Text()),
             autoincrement=False,
-            nullable=False,
+            nullable=True,
             comment="Evaluation configuration (model, instructions, etc.)",
         ),
     )
🤖 Prompt for AI Agents
In @backend/app/alembic/versions/041_add_config_in_evals_run_table.py around
lines 47 - 56, The downgrade currently re-adds the "config" column on the
"evaluation_run" table using op.add_column with sa.Column(..., nullable=False)
which will fail if rows exist; update that op.add_column call in the downgrade
to use nullable=True (or alternatively add a server_default or a prior data
migration to populate values before setting non-nullable), ensuring the column
is created nullable during downgrade to avoid PostgreSQL errors.

Comment on lines +505 to 509
elif config.completion.provider != LLMProvider.OPENAI:
raise HTTPException(
status_code=422,
detail="Only 'openai' provider is supported for evaluation configs",
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical: Invalid constant reference - LLMProvider.OPENAI does not exist.

The code references LLMProvider.OPENAI but the LLMProvider registry (backend/app/services/llm/providers/registry.py) only defines OPENAI_NATIVE = "openai-native". However, the error message and test configs use "openai" as the provider string.

This mismatch will cause an AttributeError at runtime.

🔍 Verify the correct provider constant
#!/bin/bash
# Check what constants are defined in LLMProvider
ast-grep --pattern 'class LLMProvider:
  $$$
'

# Check what provider values are used in evaluation configs
rg -n --type=py 'provider.*=.*["\']openai["\']' backend/app/

Based on the error message expecting "openai" and test data using provider="openai", you likely need either:

  1. Add OPENAI = "openai" constant to LLMProvider, or
  2. Change the validation logic to check the string directly: != "openai"
🤖 Prompt for AI Agents
In @backend/app/api/routes/evaluation.py around lines 505 - 509, The code
references a non-existent constant LLMProvider.OPENAI in the evaluation config
validation, causing AttributeError; update the check in evaluation.py (the block
that raises HTTPException) to compare against the actual provider string
"openai" (i.e., use config.completion.provider != "openai") or alternatively add
a new constant OPENAI = "openai" to the LLMProvider class in
backend/app/services/llm/providers/registry.py so the symbol exists and matches
tests; pick one approach and ensure the error message and tests remain
consistent with the chosen value.

Comment on lines +308 to +349
def resolve_model_from_config(
session: Session,
eval_run: EvaluationRun,
) -> str:
"""
Resolve the model name from the evaluation run's config.
Args:
session: Database session
eval_run: EvaluationRun instance
Returns:
Model name from config
Raises:
ValueError: If config is missing, invalid, or has no model
"""
if not eval_run.config_id or not eval_run.config_version:
raise ValueError(
f"Evaluation run {eval_run.id} has no config reference "
f"(config_id={eval_run.config_id}, config_version={eval_run.config_version})"
)

config_version_crud = ConfigVersionCrud(
session=session,
config_id=eval_run.config_id,
project_id=eval_run.project_id,
)

config, error = resolve_config_blob(
config_crud=config_version_crud,
config=LLMCallConfig(id=eval_run.config_id, version=eval_run.config_version),
)

if error or config is None:
raise ValueError(
f"Config resolution failed for evaluation {eval_run.id} "
f"(config_id={eval_run.config_id}, version={eval_run.config_version}): {error}"
)

model = config.completion.params.get("model")
return model
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fix type mismatch: model extraction can return None.

The function's return type is str, but line 348 uses config.completion.params.get("model") which can return None if the "model" key is missing. This violates the type contract and could cause issues when the model is passed to downstream functions expecting a string.

✅ Validate that model exists
     model = config.completion.params.get("model")
+    if not model:
+        raise ValueError(
+            f"Config for evaluation {eval_run.id} does not specify a model "
+            f"(config_id={eval_run.config_id}, version={eval_run.config_version})"
+        )
     return model
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def resolve_model_from_config(
session: Session,
eval_run: EvaluationRun,
) -> str:
"""
Resolve the model name from the evaluation run's config.
Args:
session: Database session
eval_run: EvaluationRun instance
Returns:
Model name from config
Raises:
ValueError: If config is missing, invalid, or has no model
"""
if not eval_run.config_id or not eval_run.config_version:
raise ValueError(
f"Evaluation run {eval_run.id} has no config reference "
f"(config_id={eval_run.config_id}, config_version={eval_run.config_version})"
)
config_version_crud = ConfigVersionCrud(
session=session,
config_id=eval_run.config_id,
project_id=eval_run.project_id,
)
config, error = resolve_config_blob(
config_crud=config_version_crud,
config=LLMCallConfig(id=eval_run.config_id, version=eval_run.config_version),
)
if error or config is None:
raise ValueError(
f"Config resolution failed for evaluation {eval_run.id} "
f"(config_id={eval_run.config_id}, version={eval_run.config_version}): {error}"
)
model = config.completion.params.get("model")
return model
def resolve_model_from_config(
session: Session,
eval_run: EvaluationRun,
) -> str:
"""
Resolve the model name from the evaluation run's config.
Args:
session: Database session
eval_run: EvaluationRun instance
Returns:
Model name from config
Raises:
ValueError: If config is missing, invalid, or has no model
"""
if not eval_run.config_id or not eval_run.config_version:
raise ValueError(
f"Evaluation run {eval_run.id} has no config reference "
f"(config_id={eval_run.config_id}, config_version={eval_run.config_version})"
)
config_version_crud = ConfigVersionCrud(
session=session,
config_id=eval_run.config_id,
project_id=eval_run.project_id,
)
config, error = resolve_config_blob(
config_crud=config_version_crud,
config=LLMCallConfig(id=eval_run.config_id, version=eval_run.config_version),
)
if error or config is None:
raise ValueError(
f"Config resolution failed for evaluation {eval_run.id} "
f"(config_id={eval_run.config_id}, version={eval_run.config_version}): {error}"
)
model = config.completion.params.get("model")
if not model:
raise ValueError(
f"Config for evaluation {eval_run.id} does not specify a model "
f"(config_id={eval_run.config_id}, version={eval_run.config_version})"
)
return model
🤖 Prompt for AI Agents
In @backend/app/crud/evaluations/core.py around lines 308 - 349,
resolve_model_from_config currently declares returning str but assigns model =
config.completion.params.get("model") which may be None; update
resolve_model_from_config to validate that model is present and a str (e.g., if
not model: raise ValueError(...) with context including eval_run.id, config_id,
config_version) before returning, or coerce/choose a safe default only if
intended; reference the resolve_model_from_config function and the model
variable from config.completion.params.get("model") when implementing the check.

Comment on lines +257 to +263
# Use model stored at creation time for cost tracking
model = resolve_model_from_config(session=session, eval_run=eval_run)

trace_id_mapping = create_langfuse_dataset_run(
langfuse=langfuse,
dataset_name=eval_run.dataset_name,
model=model,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, locate the resolve_model_from_config function
rg -n "def resolve_model_from_config" --type=py

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 146


🏁 Script executed:

# Check the imports and context around line 257-263 in processing.py
head -n 35 backend/app/crud/evaluations/processing.py | tail -n 10

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 446


🏁 Script executed:

# Look at the core.py file around line 348
sed -n '340,360p' backend/app/crud/evaluations/core.py

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 379


🏁 Script executed:

# Get the full resolve_model_from_config function with type hints
sed -n '308,330p' backend/app/crud/evaluations/core.py

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 712


🏁 Script executed:

# Check the create_langfuse_dataset_run function signature
rg -n "def create_langfuse_dataset_run" -A 5 --type=py

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 502


🏁 Script executed:

# Check if there's any validation before resolve_model_from_config is called
sed -n '250,270p' backend/app/crud/evaluations/processing.py

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 816


🏁 Script executed:

# Get the complete resolve_model_from_config function
sed -n '308,365p' backend/app/crud/evaluations/core.py

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 1340


Add validation to ensure model key exists in config params.

The resolve_model_from_config function's return type is annotated as str, and its docstring promises to raise ValueError if the config has no model. However, the implementation at line 350 uses .get("model") without validation, allowing it to return None and violating the type contract.

Add a validation check after retrieving the model to ensure the key exists, or update the return type annotation to str | None if None is acceptable. This fixes the contract mismatch between the type hint and actual implementation.

🤖 Prompt for AI Agents
In @backend/app/crud/evaluations/processing.py around lines 257 - 263,
resolve_model_from_config currently uses config.get("model") which can return
None despite its str return annotation and docstring promise; modify
resolve_model_from_config to validate the retrieved value and raise ValueError
if missing (or alternatively change the function signature to return str | None
and update callers), e.g., after fetching model = config.get("model") check if
model is truthy and raise ValueError("missing model in config") to enforce the
contract so callers like resolve_model_from_config(session=session,
eval_run=eval_run) always receive a str or an explicit None-aware type is used
consistently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request ready-for-review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add config management in Evals

5 participants