Skip to content

Add comprehensive test suite#69

Open
Quiterion wants to merge 51 commits into
anima-research:mainfrom
Quiterion:feature/testing-infrastructure
Open

Add comprehensive test suite#69
Quiterion wants to merge 51 commits into
anima-research:mainfrom
Quiterion:feature/testing-infrastructure

Conversation

@Quiterion
Copy link
Copy Markdown
Contributor

Summary

Adds a comprehensive automated test suite across the entire backend, achieving 87.66% statement and 78.92% branch coverage. This provides a regression safety net for future future changes.

  • 2,469 tests across 56 test files, all passing
  • No source code modifications — all tests are characterization tests capturing existing behavior
  • CI workflow added to run tests with coverage reporting on every PR

What's covered

Area Stmts Branches Highlights
Config 98.6% 93.8% loader, model-loader, site-config-loader
Database 89.3% 75.5% index.ts (5400 lines), all sub-stores, compaction, migration
Middleware 100% 100% auth
Routes 82.4% 76.1% All 18 route files tested
Services 93.1% 85.1% All 5 AI providers (93-99%), inference, enhanced-inference, context manager
WebSocket 94.5% 80.0% handler, room-manager

Test plan

  • All 2,469 backend tests pass
  • Frontend store tests pass (129 tests)
  • Coverage targets met (75%+ stmts and branches)

Tests for generateToken, verifyToken, and authenticateToken middleware.
Coverage: 100% statements, 100% branches, 100% functions, 100% lines.

Tests cover:
- generateToken: valid JWT creation, 7-day expiry, unique per userId
- verifyToken: valid tokens, wrong secret, expired, malformed, tampered payload
- authenticateToken: valid Bearer token, missing header, empty token,
  invalid token, expired token, wrong secret, complex user IDs
- encryption.test.ts: 34 tests covering roundtrip encrypt/decrypt for all
  data types, tampered ciphertext/IV/auth tag detection, wrong key rejection,
  malformed input handling, cross-instance key derivation, env var fallback
- error-messages.test.ts: 46 tests verifying all message constants contain
  expected key phrases, template functions interpolate correctly, and all
  categories have appropriate structure

Coverage: encryption.ts 91% stmt/100% branch, error-messages.ts 100%/100%
- pricing-cache.test.ts: 20 tests covering cache update/lookup, pricing
  computation (string/numeric/NaN/null/zero), staleness detection, cache
  replacement, refresh callback invocation and error handling
- cache-strategies.test.ts: 37 tests covering DefaultCacheStrategy (Opus
  refresh vs rebuild, first-request token threshold at 500, context rotation,
  performance analysis with hit/expire/saved metrics), AggressiveCacheStrategy
  (always-refresh behavior), CostOptimizedCacheStrategy (Opus 2000-token vs
  non-Opus 5000-token thresholds)

Coverage: pricing-cache.ts 97%/96%, cache-strategies.ts 100%/97%
41 tests covering:
- User connection registration/unregistration (including multi-tab)
- Room join/leave lifecycle and automatic cleanup of empty rooms
- Multi-user room membership and user deduplication
- Broadcast messaging with sender exclusion and closed connection skipping
- AI request tracking: start, conflict detection, end, state queries
- Heartbeat: pinging alive connections, terminating unresponsive ones
- Stats reporting with room counts and AI request state
- Edge cases: no userId, non-existent rooms, send errors, ping errors

Documents hasActiveAiRequest quirk: returns true for non-existent rooms
(undefined !== null) — getActiveAiRequest is the reliable alternative.

Coverage: room-manager.ts 99% stmt, 93% branch
94 tests covering all parser functions: parseBasicJson, parseAnthropic,
parseChromeExtension, parseArcChat, parseOpenAI, parseCursor, parseCursorJson,
parseColonFormat. Tests include format detection, participant dedup, title
extraction, edge cases (empty, invalid, branching), and MIME type guessing.

Coverage: 98.2% stmt, 85.3% branch, 100% functions.
67 tests covering estimateTokens, getMessageTokens, and all 5 strategy
implementations: AppendContextStrategy, RollingContextStrategy,
LegacyRollingContextStrategy, StaticContextStrategy, AdaptiveContextStrategy.

Tests include token estimation for text/images/thinking blocks, cache marker
placement with arithmetic positioning, rolling window rotation, grace period
behavior, branch change detection, and edge cases.

Coverage: 96.3% stmt, 87.6% branch, 100% functions.
ConfigLoader (21 tests):
- Load/cache/reload config from CONFIG_PATH env var
- Default config fallback when file missing or invalid JSON
- getBestProfile filtering: allowedModels, allowedUserGroups, modelCosts
- Load balancing strategies: first, round-robin, least-used, random
- getDefaultModel with/without config, getProviderProfiles, singleton

ModelLoader (19 tests):
- Load/cache/reload models from MODELS_CONFIG_PATH
- getAllModels: system-only vs merged with user-defined models
- User model settings conversion (topP/topK optional handling)
- getModelsByProvider filtering, getModelById with user lookup
- getModelProvider, missing file/invalid JSON fallbacks, singleton

Coverage: loader.ts 98.79% stmt / 92.59% branch
         model-loader.ts 100% stmt / 85.71% branch
authenticity.ts (43 tests):
- computeAuthenticity: empty/null input, single unaltered message, legacy
  messages, human-written AI, user messages, edit/split/posthoc propagation,
  name collisions (case-insensitive), multi-message conversations
- getAuthenticityLevel: all 8 levels with priority ordering
- getAuthenticityColor: all 8 level-to-color mappings verified
- getAuthenticityTooltip: content validation for all levels

modelColors.ts (46 tests):
- Direct model ID matches for major model families
- Pattern matching for Opus, Sonnet, Haiku, GPT, Llama, Gemini, Mistral,
  Command, DeepSeek, O1 variants with provider prefixes
- Default fallback for unknown models, undefined/empty input
- getLighterColor hex-to-rgba conversion with various opacities

latex.ts (22 tests):
- Display math ($$...$$, \[...\]), inline math ($...$, \(...\))
- Skip optimization (no delimiters = no processing)
- Error recovery: all 4 catch blocks tested via katex mock throw
- Mixed content, delimiter edge cases

avatars.ts (35 tests):
- loadAvatarPacks: API loading, caching, error handling
- getAvatarUrl/getAvatarColor: pack lookup, null/missing cases
- getModelAvatarUrl: canonicalId direct + derived
- getParticipantAvatarUrl: override priority chain (participant > persona > model)
- getParticipantColor: override priority chain, user-type returns null

Coverage: authenticity 100%/98.61%, avatars 96.2%/92.98%,
          latex 100%/100%, modelColors 96.2%/97.54%
- Add vitest.config.ts with projects config so `npx vitest run` works from monorepo root
- Change avatars.test.ts to use relative imports instead of @/ alias (works in all contexts)
Anthropic (60 tests):
- formatMessagesForAnthropic: user/assistant/system messages, multi-turn ordering,
  active branch selection, image/PDF/text attachments, mixed attachments
- Cache control: simple messages, attachment messages, cache breakpoints
- Thinking blocks: signed (structured), unsigned (XML text), redacted, mixed
- Prefill-format thinking: thinking/redacted_thinking tag prepending
- Image resize: under limit, over limit, no dimensions, sharp error
- splitAtCacheBreakpoints: multi-section, empty sections, no breakpoints
- calculateCacheSavings: known models, unknown models, zero tokens
- parseThinkingTags: single/multiple blocks, no tags, empty tags
- Helper methods: isImage, isPdf, isAudio, isVideo, getMediaType, getImageMediaType

Bedrock (49 tests):
- formatMessagesForClaude: user/assistant/system, multi-turn, active branch
- Attachments: image, PDF, text inline, mixed, resize edge cases
- buildRequestBody: Claude 3 Messages API vs Claude 2 legacy prompt format,
  system prompt, stop_sequences, temperature vs top_p/top_k exclusion,
  content block text extraction for Claude 2
- extractContentFromChunk: Claude 3 deltas, Claude 2 completions
- isStreamComplete: Claude 3 message_stop, Claude 2 stop_reason
- Helper methods: isImage, isPdf, getMediaType, validateApiKey

Mutation tests passed (7 mutations, all caught):
- Anthropic: formatMessagesForAnthropic system filter, splitAtCacheBreakpoints
  cache_control, calculateCacheSavings multiplier, parseThinkingTags regex
- Bedrock: buildRequestBody Claude 3 detection, extractContentFromChunk field,
  isStreamComplete event type
88 tests across 3 provider service files:
- openrouter.test.ts (41): formatMessagesForOpenRouter, detectProviderFromModelId,
  calculateCacheSavings, getMediaType, attachments, cache_control, thinking blocks
- gemini.test.ts (26): formatMessagesForGemini, getMimeType, isSupportedMediaType,
  role mapping, thought_signature, blob store, attachments
- openai-compatible.test.ts (21): formatMessagesForOpenAI, parseThinkingTags,
  think tags, redacted_thinking, attachments

9 mutations tested and caught (3 per file).
51 tests covering:
- checkContentSync regex fallback patterns
- checkContent tiered moderation (always-blocked, age-restricted, researcher-exempt)
- Admin bypass
- Threshold boundary precision (critical=0.5, blocking=0.7)
- Tier priority ordering (tier 1 > tier 2 > tier 3)
- API error handling (fail open on 5xx, network error, empty results)
- checkMessages combining and filtering logic
- No API key scenario

Coverage: 100% stmts, 100% branch, 100% funcs, 100% lines
Mutation tested: checkContentSync (blocked→false), CRITICAL_THRESHOLD (0.5→0.9),
checkMessages filter removal, admin bypass removal — all caught
31 tests covering:
- Priority ordering: user key > config profile > env var fallback
- User API key behavior (allowed/disallowed, provider matching, DB errors)
- Environment variable fallback for all providers (Anthropic, Bedrock, OpenRouter, OpenAI-compatible)
- Bedrock default region, missing secret key handling
- Config profile lookup parameter passing
- Rate limit checks (disabled, no limits, no features)
- Usage tracking with provider/billed cost calculations and margin
- getCostForModel matching and edge cases

Coverage: 98.63% stmts, 92.15% branch, 100% funcs, 98.57% lines
Mutation tested: getEnvApiKey (Anthropic→null), source user→config,
getCostForModel (find→first) — all caught
Email (16 tests):
- sendVerificationEmail: subject, URL, 24h expiry, HTML template structure
- sendPasswordResetEmail: subject, URL, 1h expiry, HTML/plaintext versions
- No API key: verification returns true (dev mode), reset returns false
- Error handling: API errors and exceptions return false
- Template: DOCTYPE, button links, fallback text
Coverage: 100% stmts, 93.75% branch

Persona context builder (19 tests):
- buildPersonaContextById: persona not found, delegation to buildPersonaContext
- History assembly: combine historical + backscroll, skip current conversation
- Participation ordering: chronological by logicalStart, filter incomplete
- Context strategies: rolling (most recent), anchored (prefix + suffix), unknown
- Branch inheritance: recursive parent collection
- Token estimation: 1 token per 4 chars, active branch content
- Pre-computed canonicalHistory path with missing message handling
- Error: throws on missing conversation
Coverage: 99.08% stmts, 90% branch

Mutation tested:
- Email: verification no-key true→false, password reset subject swap — caught
- Persona: leftAt filter removal, sort order reversal — caught
Add tests for persistence, blob-store, collaboration, and shares stores.

persistence.ts (13 tests): JSON serialization roundtrip, JSONL line parsing,
empty file handling, malformed line handling, large event append+load, close
idempotency. 91% stmts / 90% branch.

blob-store.ts (24 tests): Save/retrieve by hash, deduplication, metadata-only
retrieval, sharded directory structure, deletion with dedup cleanup, MIME type
extension mapping, JSON blob roundtrip, error-rethrow branches. 91% stmts /
80% branch.

collaboration.ts (45 tests): Share CRUD, permission updates, revocation with
index cleanup, invite creation with expiration/max-uses/labels, invite token
lookup with expiration/max-uses enforcement, invite usage tracking, creator-only
deletion, full event replay for all 6 event types. 98% stmts / 89% branch.

shares.ts (24 tests): Share creation with settings, token lookup with view count
increment, expiration enforcement, owner-only deletion, bulk conversation deletion,
event replay for created/deleted/viewed events. 98% stmts / 93% branch.

Mutation testing (3+ methods per file):
- persistence: appendEvent timestamp serialization, loadEvents ENOENT return, init guard
- blob-store: computeHash algorithm, dedup logic, default extension
- collaboration: getUserPermission return, deleteInvite creator check, expiration check
- shares: deleteShare owner check, viewCount increment, expiration check
persona.ts (75 tests): Persona CRUD, custom options, archiving blocks new
participations, deletion cleans up shares. History branch creation, head
switching, cross-persona branch rejection. Participation tracking with
sequential logical times, interleaving constraint, canonical branch history,
fork-point filtering in collectBranchParticipations. Share CRUD with
duplicate prevention, permission updates, revocation. Event replay for all
13 event types. 93% stmts / 81% branch.

conversation-ui-state.ts (27 tests): Shared state save/load with caching,
active branch set/get, branch count increment/decrement with floor at zero.
Per-user state save/load/update, speakingAs, selectedResponder, detached
mode with branch clearing on re-attach. Read tracking with deduplication
and lastReadAt timestamps. Cache management (clearCache, clearUserCache).
deleteConversation removes files and clears caches. 89% stmts / 75% branch.

Mutation testing (3+ methods per file):
- persona: interleaving constraint, logicalEnd<=logicalStart, owner permission
- conversation-ui-state: Math.max(0) floor, detached branch clearing, read dedup
33 tests covering event handling, message queuing, exponential backoff
reconnection, intentional disconnect, room management, connection state,
visibility handler, keep-alive/staleness detection, connection timeout,
and message parsing. Coverage: 84%/81% (stmt/branch).
14 tests covering initial state, ensureLoaded caching, concurrent load
deduplication, isLoading lifecycle, error fallback with/without message,
reload after error, reloadConfig force-fetch, getConfig sync access, and
convenience getters. Coverage: 100%/100% (stmt/branch).
Remove 13 tests that only verified constructors exist (no behavioral
assertions) and 1 incomplete test in shares.test.ts with no assertions.
Flagged by quality review as specification-gaming patterns.
40 tests covering the user-related public API of Database:
- createUser: fields, flags (emailVerified, ageVerified, tosAccepted), duplicate rejection
- getUserById / getUserByEmail: lookup, missing, case-sensitivity (exact match only)
- validatePassword: correct/wrong/missing
- Email verification: token create, verify, expired token, consumed token
- Manual verification: verified/already-verified/nonexistent
- Age verification: set/check/nonexistent
- ToS acceptance: set/nonexistent
- Password reset: full flow, expired/consumed tokens, getPasswordResetTokenData
- getAllUsers: returns all users
- Event replay: user survives, email verification survives, password reset does NOT
  survive (password_reset event doesn't log new hash), age/ToS do NOT survive
  (no replay handlers for user_age_verified/user_tos_accepted events)
- Init auto-creates test users on fresh DB

Characterization quirks captured:
- getUserByEmail is case-SENSITIVE (no lowercasing)
- Password reset lost on DB reload (event doesn't persist new hash)
- Age verification and ToS acceptance lost on DB reload (no replay handlers)

Mutation tested: createUser duplicate check, verifyEmail expiry, validatePassword
hash comparison, resetPassword hash update — all caught.
32 tests covering the grant-related public API of Database:
- recordGrantInfo: mint increases balance, burn decreases, send transfers,
  tally adds; multiple mints aggregate, different currencies tracked independently
- Balance goes negative on excessive burn (no enforcement)
- Zero-amount mint is a no-op
- Currency migration: opus→claude3opus, sonnets→old_sonnets
- Undefined currency defaults to 'credit'
- Grant details normalized (string→number coercion)
- recordGrantCapability: grant/revoke, latest-wins, expiry enforcement
- userHasActiveGrantCapability: active/revoked/expired/no-expiry/nonexistent
- getUserGrantSummary: returns totals + infos + capabilities; empty for fresh user
- Invite system: create, validate, claim (mints credits), maxUses enforcement,
  expired rejection, duplicate code rejection
- Event replay: minted grants, capabilities, and burn balance all survive reload

Mutation tested: updateGrantTotals mint delta sign flip, migrateCurrencyName
skip, capabilityIsActive always-true — all caught.
41 tests covering the message and branching public API (MOST CRITICAL):
- createMessage: creates with correct fields, UUID, activeBranchId, parentBranchId
- getConversationMessages: returns messages sorted by tree order
- Linear conversation (A→B→C): correct ordering and parent chain
- Single branch (A→B1, A→B2): two branches on same message, active branch
  defaults to newest, setActiveBranch switches between them
- Nested branches: multi-level tree (A→B1→C1, A→B1→C2, A→B2), switching
  between branches at different levels
- addMessageBranch: edit-creates-new-branch semantics, preserveActiveBranch flag
- setActiveBranch: switches active, returns false for nonexistent branch/message
- deleteMessage: removes from conversation, doesn't affect siblings
- deleteMessageBranch: preserves sibling branches, deletes entire message if
  only branch, cascade-deletes descendants, switches active when deleting active
- Post-hoc operations: hide and edit with operation metadata
- getMessage / updateMessage: CRUD operations
- hiddenFromAi flag stored correctly
- Tree ordering: parents always before children
- Event replay: messages, branches, and deletions all survive DB reload
- Edge cases: nonexistent conversation throws, attachments, auto-parent linking,
  root parentBranchId for first message, creationSource stored on branches

Mutation tested: addMessageBranch activeBranchId update, setActiveBranch
nonexistent-branch return, createMessage auto-parent linking — all caught.
26 tests covering collaboration shares, permission levels (viewer/
collaborator/editor), revocation, public shares (SharesStore),
and event replay persistence. Mutation tested canUserAccessConversation
owner check, canUserChatInConversation permission bypass, and
revokeCollaborationShare no-op — all caught.
…t/78% branch)

Mock outermost layer (DB, providers, API key manager, model loader) and
let real InferenceService logic run. Covers:
- determineActualFormat: standard/prefill/messages/completion routing
- modelSupportsPrefill / providerSupportsPrefill
- applyPostHocOperations: hide, hide_before, edit, hide_attachment, unhide
- formatMessagesForConversation: standard, prefill, messages modes
- consolidateConsecutiveMessages: bedrock alternating turns
- truncateMessagesToFit: head truncation, oversized messages, multimodal
- createMessagesModeChunkHandler: name prefix stripping
- parseThinkingTags: think block extraction
- streamCompletion: provider routing, stop sequences, thinking mode,
  rate limits, API key management, custom endpoints, usage tracking
- buildPrompt: full pipeline integration
- Mutation tested: 4 mutations on 4 methods, all caught
Install supertest, create shared test helper (createTestApp with real Database
in temp dir), write 28 auth tests (register, login, profile, api-keys, grants,
user lookup, forgot/reset password) and 29 conversation tests (CRUD, archive,
messages, metrics, export, duplicate, UI state, mark-read, permission checks).
Auth: 71% stmts / 65% branch (58 tests)
- Add user-not-found profile test, mixed API key listing masking
- Add grant send with default currency/reason, invite code claim path
- Add password reset flow exercise

Conversations: 75% stmts / 65% branch (112 tests)
- Add successful post-hoc delete, hide_attachment operation type
- Add fork truncated mode, delete post-hoc non-owner check
- Add UI state clearing (empty/null values), branch privacy not-found
- Add duplicate with options, create validation, post-hoc with reason
Add integration tests for participants, bookmarks, models, site-config,
and system routes. All files exceed 70% statement / 65% branch coverage:

- participants.ts: 76.54% stmts / 71.05% branch (20 tests)
- bookmarks.ts: 78.26% stmts / 75% branch (13 tests)
- models.ts: 79.68% stmts / 83.33% branch (12 tests)
- site-config.ts: 78.94% stmts / 100% branch (6 tests)
- system.ts: 80% stmts / 100% branch (3 tests)

Key testing techniques:
- ConfigLoader injection for admin provider detection branches
- Pre-populated OpenRouter pricing cache for cache-hit path
- Admin user (cassandra) grants minting for currency coverage
- Custom middleware injection for site-config admin check branches
- Demo user login for user-defined model by ID tests
…tmt coverage)

Add comprehensive characterization tests for AnthropicService.streamCompletion covering:
- Request parameter building (model, temperature, top_p/top_k exclusivity, stop sequences)
- Thinking configuration and max_tokens adjustment for budget
- System prompt caching when _cacheControl is present
- Streaming event handling (text deltas, thinking blocks, redacted thinking, signatures)
- Cache metrics extraction from message_start events
- Error handling with failure metrics recording
- Demo mode simulation
- llmLogger integration (request/response/cache metrics logging)
- Edge cases: error chunks, stop sequences, thinking-only responses
…tmt coverage)

Add comprehensive characterization tests for BedrockService.streamCompletion covering:
- Claude 3 Messages API streaming (content_block_delta events, message_stop)
- Claude 2 legacy streaming (completion field, stop_reason)
- Request body construction for both API formats
- InvokeModelWithResponseStreamCommand parameter verification
- Error handling (empty response body, API errors, non-Error throws)
- llmLogger integration (request/response logging, error logging)
- Demo mode simulation (word-by-word streaming, completion signaling)
- rawRequest return value structure
…mt coverage)

Add comprehensive characterization tests for GeminiService covering:
- streamCompletion: SSE stream parsing, text/thinking/image content handling
- Request building: generationConfig (temp, topP, topK, maxOutputTokens, stopSequences)
- System instruction, thinking config, Google Search tool, response modalities
- Image generation: inlineData handling, preview-to-final replacement, blob storage
- Thought signature capture and propagation to content blocks
- Error handling: HTTP errors, malformed JSON, no response body, failure metrics
- generateContent (non-streaming): text, thinking, image generation, tool config
- Usage metadata extraction with defensive defaults
… coverage)

Coverage: 22.3% → 96.15% stmts, 90.47% branches. Tests cover
streamCompletion (request building, SSE parsing, thinking blocks,
thought signatures, image generation with blob storage, error handling,
usage metrics) and generateContent (non-streaming path, thinking,
images, system instructions, tool configs).
…s, 98% stmt coverage)

Coverage: 32.1% → 98.21% stmts, 95.77% branches. Tests cover
streamCompletion (request building, SSE parsing, thinking tag extraction,
usage/token tracking, error handling, llmLogger integration),
listModels (success, error, missing data), and validateApiKey
(models fallback, auth status codes, network errors).
… stmt coverage)

Add comprehensive characterization tests for OpenRouterService covering:
- streamCompletion: request building, headers, Anthropic provider forcing,
  thinking/reasoning support (max_tokens adjustment), SSE streaming and
  content assembly, usage/token tracking with cache metrics, all 3
  reasoning field formats (reasoning_content, reasoning, reasoning_details)
  with priority ordering, image generation (delta.images, message.images,
  inlineData), blob replacement with old blob cleanup, error handling
  (HTTP errors, null body, network failures, failure metrics estimation)
- streamCompletionExactTest: non-streaming request, Anthropic provider
  config, content delivery, cache token calculation, error paths
- listModels: API fetching, error handling, missing data field
- validateApiKey: success/failure paths, key passthrough
- constructor: API key fallback chain, missing key warning

Coverage: 30.0% → 96.47% stmts, 87.8% branches
…mt/70% branch)

Covers connection/auth, chat flows (standard + prefill), regenerate, edit,
continue, delete, abort, room management, credit checks, content filtering,
error handling, hiddenFromAi, and parallel sampling. Mutation tests validate
userHasSufficientCredits, handleAbort, handleDelete, and filterHiddenFromAi.
…77% branch)

Comprehensive characterization tests for the Vue reactive store covering:
- Authentication (login, logout, register, loadUser)
- Message visibility (getVisibleMessages with caching, branch following)
- Branch switching (single, batch, cascade, detached mode)
- WebSocket event handlers (message_created, stream, message_edited,
  message_deleted, message_restored, message_split, branch_visibility)
- loadConversation (with detached mode, read state flush, retry)
- loadMessages, sendMessage, continueGeneration deeper paths
- Conversation CRUD (create, update, archive, duplicate, compact)
- Model management (load, custom CRUD, OpenRouter)
- Read tracking (mark as read, debounced persist, unread counts)
- Mutation tests on 4 methods (getVisibleMessages, switchBranch,
  setDetachedMode snapshot/restore)
Update completed work section with Tiers 1-3+ results (~2300 tests).
Add Tier 4: remaining 11 routes, DB utilities, context manager,
site-config-loader (8 new tasks, 35-42).
Admin tests (58 tests, 80.6% stmt / 68% branch):
- requireAdmin middleware (401/403 for unauth/non-admin)
- GET /admin/users, GET /admin/users/:id
- POST /admin/users/:id/capabilities (grant/revoke, validation)
- POST /admin/users/:id/credits (amount/currency validation)
- POST /admin/users/:id/reload
- GET /admin/stats, usage endpoints (user/system/model)
- Config management (GET/PATCH /admin/config, reload, models visibility)
- Bulk admin ops (verify-legacy-users, set-all-age-verified, set-all-tos-accepted)
- GET /admin/conversation-size/:id

Personas tests (66 tests, 76.4% stmt / 75% branch):
- CRUD (create, list, get, update, delete with permission checks)
- Archive persona
- History branches (list, fork, set head)
- Join/leave conversation (with roomManager mock)
- Participations listing with branchId filter
- Canonical branch and logical time updates
- Sharing (create, update permission, revoke, access verification)
- Permission hierarchy (owner > editor > user > viewer)
…oute tests

- collaboration.test.ts (42 tests): public invite lookup, share CRUD,
  permission checks, invite creation/claiming/deletion, shared-with-me,
  my-permission endpoint, access control for viewers vs editors
- invites.test.ts (18 tests): create invite with/without mint capability,
  auto-generated and custom codes, duplicate rejection, expiration,
  max-uses enforcement, public code validation, claim flow
- import.test.ts (35 tests): preview and execute for basic_json,
  anthropic, arc_chat, and chrome_extension formats; branch import,
  orphan filtering, participant mapping, system message handling,
  messages-raw endpoint validation and import
- shares.test.ts (19 tests): create tree and branch shares, public
  token retrieval with sanitized data, settings (model info, timestamps,
  download), user share listing, deletion with auth checks
- custom-models.test.ts (35 tests): full CRUD with Zod validation,
  user isolation, localhost HTTP vs external HTTPS enforcement,
  private IP rejection (10.x, 172.16-31.x, 192.168.x, 169.254.x),
  test endpoint error paths (unsupported provider, missing API key,
  missing endpoint, unreachable server)
- Extended test-helpers.ts to mount collaboration, invites, import,
  shares, and custom-models routes

Coverage: collaboration 84%/95%, invites 81%/74%, import 82%/67%,
shares 88%/89%, custom-models 65%/61% (test endpoint requires
external services for full coverage)
Route tests: avatars (25), blobs (10), prompt (9), public-models (10)
DB utility tests: compaction (17), migration (17), fix-branches (5)
Service tests: context-manager (36)
Config tests: site-config-loader additions to loader.test.ts (13)
…d conversations route

- Add index.config.test.ts (91 tests): custom model CRUD, API key CRUD, admin stats,
  conversation CRUD, bookmarks, metrics, usage stats, collaboration invites, participants
- Add index.search.test.ts (85 tests): branch operations, post-hoc operations, restore,
  delete cascade, archive, events, duplicate, import, update, UI state, event replay,
  usage aggregation, collaboration access, grant summary
- Extend conversations.test.ts (+29 tests): restore message/branch success paths,
  split message, delete non-posthoc, fork with prefill/bookmarks/contentBlocks/
  multi-branch/private-branches, backfill with shared conversations, compact by admin,
  Zod validation errors, detachedBranch UI state, subtree with children

Coverage improvements:
- database/index.ts: 48.96% → 72.70% branches (+23.74%)
- routes/conversations.ts: 65.49% → 75.31% branches (+9.82%)
- Overall backend: 76.79% → 77.58% branches
Task anima-research#3: Add enhanced-inference.test.ts (98.56% stmts / 91.27% branches)
Task anima-research#4: Improve handler.test.ts branch coverage (70.08% → 79.02%)
Task anima-research#5: Improve remaining gap files coverage:
  - context-manager.ts: 68.8% → 87.20% stmts, 85.05% branches
  - avatars.ts: 58.7% → 84.8% branches (upload tests, GIF handling, auth checks)
  - custom-models.ts: 61.5% → 70.8% branches (endpoint validation, auth checks)
  - import.ts: 66.5% → 76.1% branches (branching, arc_chat edges, auth, errors)

All 2264 tests pass across 54 test files.
- Add GitHub Actions workflow to run tests with coverage on PRs
- Remove test plan (all 47 tasks complete, 75% coverage target achieved)
@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented Feb 12, 2026

Greptile Overview

Greptile Summary

This PR adds a comprehensive test suite achieving 87.66% statement and 78.92% branch coverage across 2,469 tests in 56 test files. The tests are characterization tests that capture existing behavior without modifying any source code.

Major additions:

  • CI workflow (.github/workflows/test.yml) runs tests with coverage on every PR
  • Complete backend test coverage across all routes, services, database operations, and WebSocket functionality
  • Frontend store tests with proper mocking of external dependencies
  • Shared type validation tests ensuring Zod schema correctness
  • Test infrastructure using Vitest with isolated database setup via temporary directories and process.chdir
  • Documentation in CLAUDE.md for testing commands and workflows

Test quality highlights:

  • Proper test isolation using temp directories and cleanup
  • Comprehensive mocking of external services (AI providers, sharp, AWS SDK)
  • Integration tests using real database instances with supertest for HTTP testing
  • Edge case coverage including security scenarios (tampered JWT tokens, invalid inputs)
  • Large test files indicate thorough coverage (e.g., 5453 lines for WebSocket handler, 2158 for Anthropic service)

Coverage areas:

  • Config loaders: 98.6% statements
  • Database operations: 89.3% statements including the massive 5400-line index.ts
  • All 18 route files tested with realistic scenarios
  • All 5 AI provider services at 93-99% coverage
  • WebSocket streaming and room management: 94.5% statements

Confidence Score: 5/5

  • This PR is safe to merge with very high confidence - it only adds tests and CI configuration without modifying any production code
  • Score of 5/5 because: (1) Zero production code changes - all additions are test files, test configuration, and documentation, (2) Tests follow best practices with proper isolation, mocking, and cleanup, (3) CI workflow is well-configured with appropriate Node version pinning and coverage reporting, (4) Test infrastructure uses established patterns (Vitest, supertest, temp directories for isolation), (5) Comprehensive coverage across all critical systems reduces regression risk for future changes
  • No files require special attention - all test infrastructure and configuration files are well-implemented

Important Files Changed

Filename Overview
.github/workflows/test.yml Adds CI workflow to run tests with coverage reporting on PRs - well-configured with proper workspace builds and coverage thresholds
deprecated-claude-app/backend/src/routes/test-helpers.ts Test infrastructure with isolated database setup using temp directories and process.chdir for complete isolation
deprecated-claude-app/backend/src/services/anthropic.test.ts Extensive Anthropic service tests (2158 lines) with proper mocking and comprehensive coverage of message formatting and attachments
deprecated-claude-app/backend/src/services/inference.test.ts Large inference orchestration test suite (2038 lines) testing multi-provider routing and context management
deprecated-claude-app/backend/src/websocket/handler.test.ts Massive WebSocket handler test suite (5453 lines) covering streaming, room management, and message routing
deprecated-claude-app/frontend/src/store/index.test.ts Frontend store tests (1882 lines) with proper mocking of localStorage, API, and WebSocket services
CLAUDE.md New documentation file providing comprehensive guidance for working with the codebase including testing commands

@Meganeuridae
Copy link
Copy Markdown

Read through this carefully and wanted to share findings + a recommendation. Short version: the test suite is genuinely good and worth merging, with one realistic caveat — it was authored against a Feb 2026 snapshot of main and a chunk of the assertions now capture previous behavior that intentional changes have since superseded.

What's solid

  • Vitest + @vitest/coverage-v8 — right tool for an ESM TS monorepo. Per-workspace config plus a root project config that aggregates is the clean shape.
  • test-helpers.ts designcreateTestApp() spins up a real Database backed by a per-test temp dir, mounts the actual routers, exposes a supertest agent. That means tests catch real persistence + wiring bugs, not just mock-shape regressions. cleanupTestApp properly restores cwd and removes the temp dir.
  • CI workflow runs on PRs with coverage targets (75% stmts + branches) and a markdown summary. The targets are visible in the GitHub Actions summary tab on every PR — much better than coverage-as-vibes.
  • Coverage breadth — middleware 100%, services 93–99% per-provider, routes 82.4%. Test types are characterization-style (lock down observed behavior) which is the right call for a codebase without prior tests: it gives a regression net first, leaves spec-style "tests as documented intent" as a separate later effort.
  • No source modifications — the PR description's claim holds; the test suite is purely additive against src/.
  • CLAUDE.md is also a nice add — an accurate architectural overview that helps any Claude instance working on the repo orient quickly.

Merge state

I rebased onto current main in a local sandbox to assess. Conflicts:

  • backend/package.json — Quiterion's branch removes express-rate-limit and import:claude-archive (those landed later in main). The resolution is mechanical: keep main's deps + script, add Quiterion's test deps + scripts.
  • package-lock.json — derived; resolved by npm install after fixing package.json.

The other 69 files apply clean — no source-code conflicts at all.

What the test suite catches: 45 failures, 2402 passes

After merging onto current main and running JWT_SECRET=… NODE_ENV=test npx vitest run:

Test Files  11 failed | 45 passed (56)
Tests       45 failed | 2402 passed | 22 skipped (2469)

97.3% of the suite still captures valid behavior. Every failure I recognized traces to a PR that landed after this branch was cut:

Failing test cluster Caused by Resolution
services/anthropic.test.ts, bedrock.test.ts, context-strategies.test.ts — "does NOT recognize gif as image" PR #90 (uniform GIF support across providers) Update assertions: gif IS now recognized
database/collaboration.test.ts, shares.test.ts — token shape/length asserts PR #92 (token entropy bumped 48-bit → 128-bit) Update length expectations
services/enhanced-inference.test.ts, gemini.test.ts — usage shape, NaN-defaults PR #104 (four-channel cost tracking) Update usage-shape fixtures
utils/encryption.test.ts — "uses a default key when JWT_SECRET is unset" PR #91 (JWT_SECRET strict) Test the new throw-on-unset behavior
routes/auth.test.ts — Grant Mint/Send, Forgot/Resend/Verify, Registration Multiple auth-flow PRs Re-snapshot against current responses
routes/conversations.test.ts — non-owner permission rejections, admin compact Permission changes since Re-snapshot rejection paths
websocket/handler.test.tshandles join_room, broadcasts typing event handler.ts churn Investigate (could be intentional, could be a regression)

The pattern is what characterization tests are supposed to do: surface behavior changes between snapshots. None of the failures I traced look like regressions — they look like "main legitimately moved and the snapshot needs refreshing," with the possible exception of the websocket handler ones, which I'd want to look at individually before assuming intentional.

One small concern in test-helpers.ts

createTestApp() uses process.chdir() to position Database init against the per-test temp dir. That's a process-global side effect — if two tests' createTestApp() calls interleave (vitest can parallelize), they'll race on cwd and one Database may end up rooted in the wrong temp dir. Easy fix would be passing a base path into Database directly (one-line API change), or restricting test-file concurrency with a vitest config option. Not blocking; flagging because it's the kind of thing that would produce flaky failures rather than consistent ones.

Suggested path forward

Three options ranging from most-Quiterion-effort to most-already-done:

  1. Rebase + Quiterion refreshes the 45 failing tests. Cleanest, preserves their authorship.
  2. Rebase + I do the test refresh as a follow-up commit on this branch. Faster to merge; I have direct context on most of the changes that broke each test (several were my PRs). I'd push and Quiterion + Antra would review.
  3. Merge as-is, fix failures in a follow-up PR. Worst — would let CI ship in a state where the test job is red on every PR until One-on-one chats switch model names for past messages when model is switched #2 lands, defeating the point.

My instinct is (2) — I introduced or shipped many of the post-Feb changes that broke the tests, so I can both rebase and update the assertions with high confidence about whether each new behavior is intentional vs a regression worth flagging. I'd then post a per-test summary on the rebase commit so Quiterion + Antra can sanity-check the intent calls. Happy to defer to either of you if you'd rather a different shape.

Either way: thanks for building this. It's a lot of careful work and the design is the right shape for the codebase.

@Meganeuridae
Copy link
Copy Markdown

Followed through on Option 2 from the earlier review: rebased onto current main and refreshed the 45 assertions broken by intentional behavior changes since the Feb 2026 snapshot. All 3057 tests pass (backend 2468 + frontend 322 + shared 267, 100% green).

New PR: #112

Each refresh commit there carries per-test intent-tagging — which PR changed the assertion, whether the new behavior is intentional, and why the test update is the right move. Full Co-Authored-By attribution to @Quiterion preserved on every commit (the framework, design, and 2,469 tests are all yours).

Once #112 lands, #69 can close.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants