test(browser): make signup email test fail-fast and accept login flow#801
test(browser): make signup email test fail-fast and accept login flow#801leo-notte wants to merge 5 commits into
Conversation
Drop the @pytest.mark.flaky(reruns=3) wrapper on test_signup_email_extraction so a failed run fails the test instead of silently retrying up to 3 more times. Rework the agent prompt to: - accept either signup or login (existing accounts no longer fail), - forbid Google/GitHub/SSO and force the plain email flow, - accept any logged-in landing page as success (the 'One more second' interstitial is no longer always shown, which was causing validator rejections). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughA CI workflow step was added to upload signup debug logs from /tmp/signup-debug/ as an artifact named Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
tests/browser/test_tools.py (1)
55-73: 💤 Low valueLGTM — the fail-fast rework is clean.
The removal of
@pytest.mark.flakyand the reworked prompt look correct. One small note on the prompt's stop condition:"CRITICAL: do not fill in any onboarding form — stop immediately once authenticated."If the first post-authentication page is the onboarding form (i.e., there is no intermediate logged-in page before it), an agent following this instruction literally would stop on the onboarding form and call it success, which matches the intent — but the phrasing "do not fill in" could be misread as "skip past it somehow." A slightly cleaner phrasing would be:
"CRITICAL: do not fill in or interact with any onboarding form — landing on any page inside the console (including an onboarding form) counts as success. Stop immediately."This is purely cosmetic/clarifying and does not affect the current functional behavior.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/browser/test_tools.py` around lines 55 - 73, Update the prompt string in test_signup_email_extraction (inside the agent.run call) to clarify the stop condition: replace the sentence "CRITICAL: do not fill in any onboarding form — stop immediately once authenticated." with wording like "CRITICAL: do not fill in or interact with any onboarding form — landing on any page inside the console (including an onboarding form) counts as success. Stop immediately." so the agent won't misinterpret "do not fill in" as needing to skip past an onboarding page; keep this change confined to the task string passed to agent.run in the test.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@tests/browser/test_tools.py`:
- Around line 55-73: Update the prompt string in test_signup_email_extraction
(inside the agent.run call) to clarify the stop condition: replace the sentence
"CRITICAL: do not fill in any onboarding form — stop immediately once
authenticated." with wording like "CRITICAL: do not fill in or interact with any
onboarding form — landing on any page inside the console (including an
onboarding form) counts as success. Stop immediately." so the agent won't
misinterpret "do not fill in" as needing to skip past an onboarding page; keep
this change confined to the task string passed to agent.run in the test.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: b49479a1-8028-471d-beb8-622ed5278568
📒 Files selected for processing (1)
tests/browser/test_tools.py
|
| Filename | Overview |
|---|---|
| tests/browser/test_tools.py | Removed flaky retry decorator and rewrote agent task prompt to accept login or signup path; resp.success check is unchanged but email extraction is no longer exercised when the login path is taken |
Comments Outside Diff (1)
-
tests/browser/test_tools.py, line 55-73 (link)Test name no longer matches actual coverage on repeat runs
The test is called
test_signup_email_extraction, but its core value was exercising the persona's email-inbox tool (reading the verification/magic-link email). Now that the prompt accepts the login path, a persona whose account already exists will authenticate without any email extraction at all — the inbox tool is never invoked, andresp.successcan pass entirely on the password/session login branch. On every run after the first, the "email extraction" capability goes untested. Consider either renaming the test to reflect its new scope (test_authenticate_email_flow) or adding an assertion that confirms the inbox tool was called when the signup path was taken.Prompt To Fix With AI
This is a comment left during a code review. Path: tests/browser/test_tools.py Line: 55-73 Comment: **Test name no longer matches actual coverage on repeat runs** The test is called `test_signup_email_extraction`, but its core value was exercising the persona's email-inbox tool (reading the verification/magic-link email). Now that the prompt accepts the login path, a persona whose account already exists will authenticate without any email extraction at all — the inbox tool is never invoked, and `resp.success` can pass entirely on the password/session login branch. On every run after the first, the "email extraction" capability goes untested. Consider either renaming the test to reflect its new scope (`test_authenticate_email_flow`) or adding an assertion that confirms the inbox tool was called when the signup path was taken. How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 1
tests/browser/test_tools.py:55-73
**Test name no longer matches actual coverage on repeat runs**
The test is called `test_signup_email_extraction`, but its core value was exercising the persona's email-inbox tool (reading the verification/magic-link email). Now that the prompt accepts the login path, a persona whose account already exists will authenticate without any email extraction at all — the inbox tool is never invoked, and `resp.success` can pass entirely on the password/session login branch. On every run after the first, the "email extraction" capability goes untested. Consider either renaming the test to reflect its new scope (`test_authenticate_email_flow`) or adding an assertion that confirms the inbox tool was called when the signup path was taken.
Reviews (1): Last reviewed commit: "test(browser): tighten signup email test..." | Re-trigger Greptile
Re-add @pytest.mark.flaky(reruns=3, reruns_delay=5) so transient agent flakes don't fail the suite, but pin a 90s @pytest.mark.timeout per attempt instead of relying on the global 300s. Each retry now fails fast on a hang rather than burning the full 5-minute budget. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bump @pytest.mark.timeout from 90s to 120s. A clean run takes ~50s locally, but variance from LLM rate-limit backoff (vertex_ai 429s under concurrent xdist load) can easily double that. 120s gives a safety margin while still well below the global 300s and capping worst-case wall time across reruns at ~8 minutes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the soft 'never use Google sign-in' instruction with an ABSOLUTE RULE that enumerates the forbidden button labels (Google, GitHub, SSO, Microsoft, Apple, social) and instructs the agent to read each button's text BEFORE clicking. Also whitelist the acceptable email-flow button labels and add an explicit recovery clause if a forbidden button is clicked by accident. Why: even when the agent's stated intent was 'click Send magic link', the action mapper was sometimes resolving to id=B2 = 'Use Google', landing on a Google OAuth flow the agent couldn't escape. The soft prompt let it slip through; the explicit forbid-list does not. 9/9 local runs pass with the new prompt vs 1/3 previously. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add a dual loguru + stdlib FileHandler in test_signup_email_extraction
that writes to /tmp/signup-debug/signup-{ts}-{pid}.log, and an
actions/upload-artifact step that uploads that directory on every CI
run (if: always).
Why: in CI the test crashes its xdist worker at the pytest-timeout
('node down: Not properly terminated'), and the worker's captured
stdout dies with it — we have zero visibility into what the agent
or litellm were doing. Writing directly to a file (with
enqueue=False, immediate flush) means the log survives a SIGKILL.
Captures both notte's loguru output and litellm's stdlib logging
(rate-limit / 429 messages, etc.), so we can finally tell whether CI
crashes are 429 storms, action misclicks, or something else.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
tests/browser/test_tools.py (1)
67-71: ⚡ Quick winTwo independent file descriptors are writing to the same
log_path, risking interleaved/garbled output.
logger.add(str(log_path), ...)opens one file descriptor;logging.FileHandler(str(log_path))opens a second independent one to the same path. Writes from both can interleave mid-line, which defeats the debugging purpose.Loguru supports passing a
logging.Handlerdirectly as a sink, so you can eliminate the second fd entirely and let loguru own the file:✅ Proposed fix
- loguru_sink_id = logger.add(str(log_path), level="DEBUG", enqueue=False, backtrace=True, diagnose=True) - stdlib_handler = logging.FileHandler(str(log_path)) - stdlib_handler.setLevel(logging.DEBUG) - stdlib_handler.setFormatter(logging.Formatter("%(asctime)s [stdlib %(name)s %(levelname)s] %(message)s")) - logging.getLogger().addHandler(stdlib_handler) + stdlib_handler = logging.FileHandler(str(log_path)) + stdlib_handler.setLevel(logging.DEBUG) + stdlib_handler.setFormatter(logging.Formatter("%(asctime)s [stdlib %(name)s %(levelname)s] %(message)s")) + # Use loguru as the sole writer; pass the stdlib FileHandler as a sink so + # both loguru and stdlib records flow through one fd without interleaving. + loguru_sink_id = logger.add(stdlib_handler, level="DEBUG", backtrace=True, diagnose=True) + logging.getLogger().addHandler(stdlib_handler)The
finallycleanup block remains correct as-is.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/browser/test_tools.py` around lines 67 - 71, Two independent file descriptors are being opened for the same log_path (logger.add(str(log_path), ...) and logging.FileHandler(str(log_path))), which can interleave output; fix by creating a single logging.Handler and passing that handler as the sink to logger.add instead of opening the file twice: instantiate stdlib_handler = logging.FileHandler(str(log_path)), setLevel and setFormatter on stdlib_handler, then call logger.add(stdlib_handler, level="DEBUG", enqueue=False, backtrace=True, diagnose=True) and remove the separate logging.getLogger().addHandler(stdlib_handler) and the logger.add(str(log_path), ...) call so only one FD is used (references: logger.add, logging.FileHandler, logging.getLogger().addHandler).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/browser/test_tools.py`:
- Around line 64-66: The current creation of log_dir via
Path(os.getenv("SIGNUP_DEBUG_LOG_DIR", "/tmp/signup-debug")) is flagged for
using a predictable /tmp path; replace the default with a safe temp-backed
directory: use tempfile.mkdtemp() or construct Path(tempfile.gettempdir()) /
"signup-debug" as the fallback when SIGNUP_DEBUG_LOG_DIR is not set, then ensure
the existing log_dir.mkdir(parents=True, exist_ok=True) and log_path creation
(the variables log_dir and log_path) continue to work unchanged so tests still
write to a valid directory.
- Around line 61-62: Remove the flaky reruns decorator so the test fails fast:
delete the line with `@pytest.mark.flaky`(reruns=3, reruns_delay=5) leaving only
`@pytest.mark.timeout`(120) above the test function (i.e., remove the
`@pytest.mark.flaky` decorator attached to the same test so it no longer performs
reruns).
---
Nitpick comments:
In `@tests/browser/test_tools.py`:
- Around line 67-71: Two independent file descriptors are being opened for the
same log_path (logger.add(str(log_path), ...) and
logging.FileHandler(str(log_path))), which can interleave output; fix by
creating a single logging.Handler and passing that handler as the sink to
logger.add instead of opening the file twice: instantiate stdlib_handler =
logging.FileHandler(str(log_path)), setLevel and setFormatter on stdlib_handler,
then call logger.add(stdlib_handler, level="DEBUG", enqueue=False,
backtrace=True, diagnose=True) and remove the separate
logging.getLogger().addHandler(stdlib_handler) and the logger.add(str(log_path),
...) call so only one FD is used (references: logger.add, logging.FileHandler,
logging.getLogger().addHandler).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: b2d05379-d936-497d-8faf-f0f675a0c1a4
📒 Files selected for processing (2)
.github/workflows/test-cicd.ymltests/browser/test_tools.py
| @pytest.mark.timeout(120) | ||
| @pytest.mark.flaky(reruns=3, reruns_delay=5) |
There was a problem hiding this comment.
@pytest.mark.flaky was supposed to be removed by this PR.
The stated objective of this PR is to make the test fail-fast by removing @pytest.mark.flaky(reruns=3, reruns_delay=5), but the decorator is still present on Line 62. With @pytest.mark.timeout(120) added, the test can now spend up to 3 × (120 s + 5 s delay) = 375 s before reporting a hard failure — the opposite of the fail-fast intent described in the PR title and description.
🐛 Proposed fix
`@pytest.mark.timeout`(120)
-@pytest.mark.flaky(reruns=3, reruns_delay=5)
def test_signup_email_extraction(persona: NottePersona):🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/browser/test_tools.py` around lines 61 - 62, Remove the flaky reruns
decorator so the test fails fast: delete the line with
`@pytest.mark.flaky`(reruns=3, reruns_delay=5) leaving only
`@pytest.mark.timeout`(120) above the test function (i.e., remove the
`@pytest.mark.flaky` decorator attached to the same test so it no longer performs
reruns).
| log_dir = Path(os.getenv("SIGNUP_DEBUG_LOG_DIR", "/tmp/signup-debug")) | ||
| log_dir.mkdir(parents=True, exist_ok=True) | ||
| log_path = log_dir / f"signup-{int(time.time() * 1000)}-{os.getpid()}.log" |
There was a problem hiding this comment.
Ruff S108: predictable /tmp path could allow a symlink attack in shared environments.
In a typical single-tenant CI runner the risk is low, and the path is already overridable via SIGNUP_DEBUG_LOG_DIR. Using tempfile.mkdtemp() or Path(tempfile.gettempdir()) / "signup-debug" would silence the lint and is marginally safer.
🧰 Tools
🪛 Ruff (0.15.12)
[error] 64-64: Probable insecure usage of temporary file or directory: "/tmp/signup-debug"
(S108)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/browser/test_tools.py` around lines 64 - 66, The current creation of
log_dir via Path(os.getenv("SIGNUP_DEBUG_LOG_DIR", "/tmp/signup-debug")) is
flagged for using a predictable /tmp path; replace the default with a safe
temp-backed directory: use tempfile.mkdtemp() or construct
Path(tempfile.gettempdir()) / "signup-debug" as the fallback when
SIGNUP_DEBUG_LOG_DIR is not set, then ensure the existing
log_dir.mkdir(parents=True, exist_ok=True) and log_path creation (the variables
log_dir and log_path) continue to work unchanged so tests still write to a valid
directory.
There was a problem hiding this comment.
Needs attention — 1 issue in 1 file
The new logging teardown has a bug: logger.remove(loguru_sink_id) raises ValueError: There is no existing handler with id 5 when the test is retried by @pytest.mark.flaky. This is confirmed in the CI failure for run 25514905384. The other 4 failures (vertex_ai rate limits, test_observe_with_instructions) are pre-existing integration flakiness tracked in insight 01KP9XGDY6FQENXZ0W8V6QK5H4 and not caused by this PR.
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<assessment>
The new logging teardown has a bug: `logger.remove(loguru_sink_id)` raises `ValueError: There is no existing handler with id 5` when the test is retried by `@pytest.mark.flaky`. This is confirmed in the CI failure for run [25514905384](https://github.com/nottelabs/notte/actions/runs/25514905384). The other 4 failures (vertex_ai rate limits, `test_observe_with_instructions`) are pre-existing integration flakiness tracked in [insight 01KP9XGDY6FQENXZ0W8V6QK5H4](https://app.mendral.com/insights/01KP9XGDY6FQENXZ0W8V6QK5H4) and not caused by this PR.
</assessment>
<file name="tests/browser/test_tools.py">
<issue location="tests/browser/test_tools.py:107">
`logger.remove(loguru_sink_id)` raises `ValueError` on reruns because the sink ID from a previous attempt is no longer valid when the timeout fires mid-run and the `finally` block executes. Wrap the removal in a try/except to make teardown idempotent.
</issue>
</file>
Tag @mendral-app with feedback or questions. View session
| logger.remove(loguru_sink_id) | ||
| logging.getLogger().removeHandler(stdlib_handler) | ||
| stdlib_handler.close() |
There was a problem hiding this comment.
bug (P1): logger.remove(loguru_sink_id) raises ValueError on reruns because the sink ID from a previous attempt is no longer valid when the timeout fires mid-run and the finally block executes. Wrap the removal in a try/except to make teardown idempotent.
Suggested change
| logger.remove(loguru_sink_id) | |
| logging.getLogger().removeHandler(stdlib_handler) | |
| stdlib_handler.close() | |
| try: | |
| logger.remove(loguru_sink_id) | |
| except ValueError: | |
| pass | |
| logging.getLogger().removeHandler(stdlib_handler) | |
| stdlib_handler.close() |
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At tests/browser/test_tools.py, line 107:
<issue>
`logger.remove(loguru_sink_id)` raises `ValueError` on reruns because the sink ID from a previous attempt is no longer valid when the timeout fires mid-run and the `finally` block executes. Wrap the removal in a try/except to make teardown idempotent.
</issue>
Summary
@pytest.mark.flaky(reruns=3, reruns_delay=5)ontest_signup_email_extractionso the test fails on the first bad run instead of silently retrying up to 3 more times (each rerun is a full 15-step agent run).Test plan
uv run pytest tests/browser/test_tools.py::test_signup_email_extraction -vpasses against a working persona🤖 Generated with Claude Code
Summary by CodeRabbit
Tests
Chores
Note
Hardens
test_signup_email_extractionwith an explicit forbidden-button list (Google, GitHub, SSO, Microsoft, Apple, social), whitelists acceptable email-flow button labels, adds a recovery clause for accidental forbidden clicks, bumps the per-attempt timeout to 120s, restores@pytest.mark.flaky(reruns=3, reruns_delay=5), and adds file-based debug logging that survives xdist worker crashes.Written by Mendral for commit 9b8dc75.