Skip to content

feat(kb): complete phase 1A storage decoupling and tests#158

Merged
qinxuye merged 21 commits into
xorbitsai:mainfrom
sqhyz55:feat/kb-phase1a-storage-decoupling
Apr 9, 2026
Merged

feat(kb): complete phase 1A storage decoupling and tests#158
qinxuye merged 21 commits into
xorbitsai:mainfrom
sqhyz55:feat/kb-phase1a-storage-decoupling

Conversation

@sqhyz55

@sqhyz55 sqhyz55 commented Mar 16, 2026

Copy link
Copy Markdown
Collaborator
  • introduce MetadataStore/VectorIndexStore/KBWriteCoordinator contracts and LanceDB-backed defaults
  • refactor RAG management and API paths to use storage abstractions instead of direct LanceDB access
  • add storage factory singleton reset hook and update tests to depend on storage layer (including collection manager and collections tests)
  • ensure phase 1A keeps doc_id semantics while remaining compatible with future file_id linkage

@sqhyz55 sqhyz55 linked an issue Mar 16, 2026 that may be closed by this pull request
@sqhyz55 sqhyz55 requested a review from rogercloud March 16, 2026 07:53
@sqhyz55 sqhyz55 force-pushed the feat/kb-phase1a-storage-decoupling branch from 17767c1 to f1cc176 Compare March 16, 2026 07:58
@sqhyz55 sqhyz55 marked this pull request as draft March 16, 2026 10:00
@sqhyz55 sqhyz55 force-pushed the feat/kb-phase1a-storage-decoupling branch from e4ae0a0 to 2926c29 Compare March 19, 2026 07:52
@AldenWangExis

Copy link
Copy Markdown
Collaborator

[ThumbsUp]

@sqhyz55 sqhyz55 force-pushed the feat/kb-phase1a-storage-decoupling branch from 2926c29 to c6f5485 Compare March 24, 2026 02:34
@AldenWangExis

Copy link
Copy Markdown
Collaborator

It looks like the main functionality is ready (c6f5485). Are you planning to take this out of draft to start the PR checks?

@sqhyz55 sqhyz55 marked this pull request as ready for review March 24, 2026 03:02
@qinxuye

qinxuye commented Apr 1, 2026

Copy link
Copy Markdown
Contributor

A lot of conflicts to resolve.

Let's focus to merge this within this week.

@sqhyz55

sqhyz55 commented Apr 1, 2026

Copy link
Copy Markdown
Collaborator Author

feat/kb-phase1a-storage-decoupling

OK

@rogercloud rogercloud left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review: Phase 1A Storage Decoupling

Thanks for this PR -- the storage abstraction direction is sound and test coverage is improved.

I am requesting changes for 3 critical and 6 major issues:

Critical:

  1. aggregate_collection_stats loads entire tables into Python memory (OOM risk)
  2. Inconsistent model field between write/read in ingestion (duplicate embeddings risk)
  3. list_collections silently drops configs when user_id=None

Major:
4. asyncio.run() per-collection in loop
5. 12x duplicate get_connection_from_env()
6. Auto-migration during read operations
7. Inconsistent search fallback errors
8. Hub resolution silently swallows errors
9. Tag collision risk in hub resolution

Minor (6): Arbitrary *100 multiplier, KBWriteCoordinator shell, deployment-specific script, test mock level, get_raw_connection leaks, redundant fixtures

Comment thread src/xagent/core/tools/core/RAG_tools/storage/lancedb_stores.py Outdated
Comment thread src/xagent/core/tools/core/RAG_tools/management/collections.py
Comment thread src/xagent/core/tools/core/RAG_tools/management/collections.py
Comment thread src/xagent/core/tools/core/RAG_tools/chunk/chunk_document.py
Comment thread src/xagent/core/tools/core/RAG_tools/management/collections.py
Comment thread src/xagent/core/tools/core/RAG_tools/storage/factory.py Outdated
Comment thread scripts/set_nanwang_embedding_model_id.py Outdated
Comment thread tests/core/tools/core/RAG_tools/management/test_collection_manager.py Outdated
Comment thread src/xagent/core/tools/core/RAG_tools/management/status.py
Comment thread tests/conftest.py Outdated
@sqhyz55 sqhyz55 force-pushed the feat/kb-phase1a-storage-decoupling branch 2 times, most recently from d910b5b to 39755dd Compare April 2, 2026 07:50

@rogercloud rogercloud left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New Review Findings (beyond existing rogercloud comments)

5 new findings identified while re-verifying all existing comments against the checked-out code:

# Severity File Summary
1 Major search_sparse.py search_sparse_async incorrectly assumes FTS is enabled after create_index
2 Minor search_engine.py create_index return parsed via fragile "advice:" string split
3 Minor vector_manager.py _open_embeddings_table bypasses storage abstraction
4 Minor search_engine.py, search_sparse.py Redundant get_vector_index_store() calls
5 Minor lancedb_stores.py LanceDBMetadataStore.get_raw_connection returns uncached connection

Only finding #1 needs action before merge; the rest are cleanup items for a follow-up.

Comment thread src/xagent/core/tools/core/RAG_tools/retrieval/search_sparse.py Outdated
Comment thread src/xagent/core/tools/core/RAG_tools/retrieval/search_engine.py Outdated
Comment thread src/xagent/core/tools/core/RAG_tools/retrieval/search_engine.py
Comment thread src/xagent/core/tools/core/RAG_tools/storage/lancedb_stores.py
Comment thread src/xagent/core/tools/core/RAG_tools/retrieval/search_engine.py Outdated
Comment thread src/xagent/core/tools/core/RAG_tools/retrieval/search_sparse.py
sqhyz55 and others added 13 commits April 7, 2026 15:06
- introduce MetadataStore/VectorIndexStore/KBWriteCoordinator contracts and LanceDB-backed defaults
- refactor RAG management and API paths to use storage abstractions instead of direct LanceDB access
- add storage factory singleton reset hook and update tests to depend on storage layer (including collection manager and collections tests)
- ensure phase 1A keeps doc_id semantics while remaining compatible with future file_id linkage
- Make LanceDBVectorIndexStore connection lazy to avoid early default-dir binding
- Add clear_connection_cache() to reset provider-level cached connections
- Isolate LANCEDB_DIR per test and reset KB storage singleton for clean state
- Add assertion test to ensure default LanceDB directory is not modified
Use Hub ID as the single source of truth for metadata, embedding rows, and table naming, and add forward-migration compatibility plus test hardening for sparse search and isolated LanceDB migration tests.
Introduce storage abstraction contracts to decouple API and management
layers from direct LanceDB dependencies. This enables future backend
migration while maintaining backward compatibility.

Storage Layer:
- Extend MetadataStore with config operations (save/get_collection_config)
- Extend VectorIndexStore with aggregate and delete operations
  (aggregate_collection_stats, aggregate_document_stats, delete_collection_data)
- Implement all contracts in LanceDBMetadataStore and LanceDBVectorIndexStore
- Add factory singleton (get_kb_write_coordinator) for coordinated access

API Layer (api/kb.py):
- Replace direct get_connection_from_env() calls with storage abstractions
- Remove manual embeddings table updates in rename_collection_api
- Use get_metadata_store() and get_vector_index_store() throughout

Management Layer (management/collections.py):
- Refactor list_collections() to use VectorIndexStore.aggregate_collection_stats()
- Refactor delete_collection() to use VectorIndexStore.delete_collection_data()
- Refactor cancel_collection() to use VectorIndexStore.list_document_records()
- Refactor get_document_stats() to use VectorIndexStore.aggregate_document_stats()
- Remove 324 lines of direct LanceDB operations

Tests:
- Add 90 lines of new storage layer tests
- Update multitenancy tests to mock storage abstractions
- Update kb_dir API tests for new mock patterns
- All 767 RAG tests passing

Phase 1A Constraints Met:
- Interface decoupling only (no physical database split)
- doc_id maintained as primary key
- Backward compatibility via get_raw_connection() on all contracts
- No changes to existing data schemas
Phase 1A Part 2: Complete storage decoupling for ingestion status tracking.

Core Changes:
- Add StorageFactory unified class for managing all store singletons
- Add IngestionStatusStore contract with sync/async methods
- Implement LanceDBIngestionStatusStore with full CRUD operations
- Refactor status.py to use storage abstraction layer
- Add stub contracts for PromptTemplateStore and MainPointerStore

Storage Abstraction Layer:
- contracts.py: Add IngestionStatusStore, PromptTemplateStore, MainPointerStore
- factory.py: Replace standalone functions with StorageFactory class
  - Thread-safe singleton pattern with double-checked locking
  - Unified reset_all() method for test isolation
- lancedb_stores.py: Implement LanceDBIngestionStatusStore
  - Sync methods: write_ingestion_status, load_ingestion_status, clear_ingestion_status
  - Async methods: *_async variants (delegates to sync for now, true async I/O in Phase 1B)
  - Helper methods: _build_base_filter, _build_load_filter, _combine_filters

Business Logic Refactoring:
- status.py: Remove raw connection usage, use get_ingestion_status_store()
  - Maintain backward-compatible public API
  - Add async variants: write_ingestion_status_async, load_ingestion_status_async, clear_ingestion_status_async

Tests:
- test_status.py: Add 4 async tests (12 total tests passing)
  - test_write_ingestion_status_async
  - test_write_ingestion_status_overwrites_existing_async
  - test_load_ingestion_status_by_collection_async
  - test_clear_ingestion_status_async

Related Files:
- storage/__init__.py: Export new contracts and factory functions

This completes Phase 2.1 of the storage decoupling plan (plan/storage-decoupling-phase2.md).
Next phases will extend VectorIndexStore for index management and implement PromptTemplateStore.
- Extend VectorIndexStore contract with index management methods
  (should_reindex, trigger_reindex with sync/async variants)
- Implement index management in LanceDBVectorIndexStore
- Add comprehensive index management tests (7 tests)
- Remove duplicate index functions from vector_manager.py
  (_should_reindex, _trigger_reindex now use storage layer)
- Update tests to use new StorageFactory API
- Fix threading deadlock: use RLock instead of Lock in StorageFactory

Test results:
- Storage tests: 22/22 passing
- Vector manager tests: 38/42 passing (4 pre-existing failures)
…nified filter system

- Extend FilterOperator with IS_NULL and IS_NOT_NULL for NULL value handling
- Create lancedb_filter_utils.py with shared filter translation functions
  (translate_condition, format_value, translate_filter_expression)
- Implement PromptTemplateStore contract with full CRUD operations
  (save_prompt_template, get_prompt_template, get_latest_prompt_template,
   list_prompt_templates, delete_prompt_template + async variants)
- Implement LanceDBPromptTemplateStore with version management
- Implement MainPointerStore contract with version control operations
  (set_main_pointer, get_main_pointer, list_main_pointers, delete_main_pointer
   + async variants)
- Implement LanceDBMainPointerStore using unified FilterExpression system
  (handles model_tag == '' OR model_tag IS NULL for backward compatibility)
- Add user_id parameter with deprecation warnings (schema migration required)
- Refactor LanceDBVectorIndexStore to use shared filter utilities
- Add comprehensive tests (8 new tests for PromptTemplateStore and MainPointerStore)

Test results:
- Storage tests: 30/30 passing
…ore decoupling

Extend PromptTemplateStore and MainPointerStore contracts to support complete business logic,
and refactor manager layers to fully use storage abstractions.

Changes:

1. PromptTemplateStore contract (contracts.py):
   - Add update_metadata() - Update metadata only, keep same version/ID
   - Add delete_by_name() - Delete by name with is_latest management
   - Add get_versions_by_name() - Get all versions of a template
   - Add async variants for all new methods

2. LanceDBPromptTemplateStore implementation (lancedb_stores.py):
   - Implement update_metadata() with in-place metadata update
   - Implement delete_by_name() with is_latest flag management
   - Implement get_versions_by_name() for version listing
   - Fix delete_prompt_template() to update is_latest for remaining versions
   - Add async variants delegating to sync methods

3. prompt_manager.py refactoring:
   - update_prompt_template() uses update_metadata() for metadata-only updates
   - delete_prompt_template() uses delete_by_name() and delete_prompt_template()
   - list_prompt_templates() uses store abstraction directly
   - Remove unused imports (escape_lancedb_string, pandas)
   - Remove unused _deserialize_metadata() function

4. main_pointer_manager.py refactoring:
   - All functions use MainPointerStore abstraction
   - Remove get_connection_from_env() function
   - Remove _build_base_filter_expression() function
   - Remove ensure_main_pointers_table import
   - Remove dependencies on escape_lancedb_string and build_lancedb_filter_expression

5. Test updates:
   - Fix mock objects in test_lancedb_stores.py for new methods
   - Rewrite test_main_pointer_manager.py to use new abstraction layer
   - All 78 tests passing (44 prompt_manager + 25 storage + 9 main_pointer_manager)

This completes Phase 2.4 requirements:
- PromptTemplateStore extended to support complete business logic
- MainPointerStore extended to support complete business logic
- Manager layers fully refactored to use storage abstractions
- No direct get_connection_from_env() usage in business logic
- All tests passing
Fix ruff, mypy, and formatting issues reported by pre-commit hooks.

Changes:

1. Import fixes:
   - Add missing IndexPolicy import in lancedb_stores.py
   - Remove unused imports (cast, IndexPolicy, build_filter_from_dict, datetime)
   - Remove unused variables (model_tag_arg, custom_policy)

2. Type safety fixes:
   - Add type assertions for name parameter in prompt_manager.py
   - Fix FilterExpression type invariants (use tuple instead of list)
   - Add type: ignore comments for lancedb.connect_async
   - Add cast() for pandas.to_dict("records") return type

3. Code cleanup:
   - Remove NULL model_tag fix-up logic in LanceDBMainPointerStore.set_main_pointer()
   - Remove duplicate test function (test_write_vectors_reindex_policy_configuration)
   - Remove _build_base_filter() call (no longer needed)

4. Formatting:
   - Apply ruff-format auto-formatting to all files

All Python linters now pass:
- ruff check: Passed
- ruff format: Passed
- mypy: Passed
- isort: Passed
- codespell: Passed

Note: npm type-check fails due to frontend issues unrelated to this work.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Update test mocks and expectations to align with Phase 1A abstraction layer:

- test_register_document.py: Fix mock method names (count_rows -> count_rows_or_zero)
- test_multitenancy.py: Fix mock paths (status.get_metadata_store -> factory.get_metadata_store)
- test_list_candidates.py: Update expected filter expression to include user_id
- test_vector_manager.py: Update tests to expect multiple calls after refactoring:
  - Batch processing tests: 3 calls for upsert_embeddings (batch_size=2)
  - Chunk reading tests: 2 calls for count_rows_or_zero and iter_batches (chunks + embeddings)
- test_index_manager.py: Skip 5 tests for functionality moved to VectorIndexStore

All RAG flow tests now passing (781 passed, 6 skipped)
Add autouse fixture to mock _ensure_schema_fields globally, avoiding
'Mock object is not iterable' errors when tests call schema management
functions.

Also fixes:
- test_metadata_store_save_collection_config: add SimpleNamespace schema mock
- test_metadata_store_get_collection_config_success: fix iloc[0] access pattern mock
- test_vector_store_list_document_records_filters_and_maps: add schema mock

All 25 storage tests now passing.
…gging

Add structured logging utilities and comprehensive test coverage for Phase 1A
storage layer, improving observability and code quality.

**Structured Logging (logging_utils.py):**
- log_operation: context manager for timing and structured output
- log_async_operation: decorator for async operations with automatic timing
- log_audit: audit event logging for security/compliance tracking
- log_performance: performance metrics logging with optional value parameter

**Performance & Audit Logs (lancedb_stores.py):**
- search_vectors_async: log top_k, vector_dim, result_count
- iter_batches_async: log batch_size, columns_provided
- count_rows_async: log row_count, has_filter
- upsert_documents_async: log record_count
- list_document_records: add audit log for data access tracking

**Test Coverage Improvements (+9 tests, 32→41 in test_lancedb_stores.py):**
- Async method tests: search_vectors_async, search_fts_async, iter_batches_async,
  count_rows_async, upsert_documents_async, upsert_chunks_async, upsert_embeddings_async
- Core upsert tests: upsert_documents, upsert_parses, upsert_chunks (including edge cases)
- Error handling tests: table_not_found, search_failure, invalid_data, invalid_columns

**Test Results:**
- All 814 RAG tests passing (805→814, +9 new tests)
- Async method coverage significantly improved (from 2/30 baseline)
- Overall test coverage ~85%
sqhyz55 added 6 commits April 7, 2026 15:23
…ive tests

- Add get_vector_dimension() method to VectorIndexStore contract
  - Abstract vector dimension retrieval from table schema
  - Implement in LanceDBVectorIndexStore with sync/async variants
- Refactor rebuild_collection_metadata() to use abstraction layer
  - Replace get_raw_connection() with count_rows_or_zero()
  - Replace direct schema access with get_vector_dimension()
  - Remove legacy  .to_pandas() call
- Delete unused _get_connection() method in CollectionManager
- Add 12 new tests for get_vector_dimension, list_table_names, and rebuild_collection_metadata
- Fix mypy error with cast(int, vector_type.list_size)
- Fix logging configuration pollution in test_real_ingestion.py
- Remove deployment-specific script set_nanwang_embedding_model_id.py
…test failures

This commit addresses the remaining unresolved review comments from PR xorbitsai#158
and fixes associated test failures. All critical and major issues are now resolved.

- **Comment xorbitsai#2**: Fix inconsistent model field write vs read
  - Unify write and read paths to use `embedding_config.id` as single source of truth
  - Eliminates duplicate embeddings caused by model ID mismatch

- **Comment xorbitsai#3**: Fix user_id=None losing configs in admin mode
  - Admin mode (is_admin=True, user_id=None) now keeps uid=None
  - Enables admin workflows to load all collection configs correctly

- **Comment xorbitsai#4**: Implement thread isolation for async execution
  - Add `_run_in_separate_loop()` to handle async in sync contexts
  - Prevents "event loop already running" errors in FastAPI handlers
  - Automatically detects execution context and chooses appropriate strategy

- **Comment xorbitsai#6**: Remove auto-migration during read operations
  - Eliminate `while True` migration loop in `_open_embeddings_table()`
  - Use legacy fallback only; explicit migration via `migrate_embeddings_table()`
  - Prevents inline migration without locks or concurrency protection

- **Comment xorbitsai#7**: Unify error handling between search_engine and search_sparse
  - Update `search_sparse.py` to re-raise primary_exc like `search_engine.py`
  - Ensures consistent error messages for "table not found" scenarios

- **Comment xorbitsai#8**: Add structured logging for Hub resolution failures
  - Replace silent exception swallowing with structured logging
  - Log error_type, error_message, fallback_behavior, and impact
  - Applies to both `collection_manager.py` and `migration_utils.py`

- **Comment xorbitsai#13**: Fix mock tests to use real storage layer
  - Update test patches from `get_connection_from_env` to correct paths
  - Tests now use real storage implementations via `conftest.py` fixtures
  - Add `temp_lancedb_dir`, `real_store`, `manager_with_real_store` fixtures

- 800+ tests passing (up from 757 before fixes)
- All mock path issues resolved
- Remaining 10 test failures are due to `owners` field (known issue from later development)

**Core Implementation:**
- `collection_manager.py`: Add `_run_in_separate_loop()` and structured logging
- `collections.py`: Fix user_id=None handling for admin mode
- `document_ingestion.py`: Unify model field usage
- `search_sparse.py`: Match error handling with `search_engine.py`
- `migration_utils.py`: Add structured logging
- `vector_manager.py`: Remove auto-migration code

**Test Files:**
- Multiple test files updated for correct mock paths
- Add `management/conftest.py` for real storage fixtures
- Update 30+ test files to use `get_vector_store_raw_connection`

**Storage Layer:**
- `lancedb_stores.py`: Maintain `get_connection_from_env` to avoid circular imports
- Test patches updated to target correct module paths
…S detection

This commit addresses New Finding 1 and New Finding 2 from PR xorbitsai#158 review:
- Finding 1: Fix async FTS detection logic to match sync version
- Finding 2: Replace fragile string parsing with structured IndexResult

**Problem:** `search_sparse_async` incorrectly assumed FTS is enabled when
`create_index()` succeeds, but it may only create vector index, not FTS.

**Solution:**
- Add `fts_enabled` field to IndexResult dataclass
- Update `create_index()` to return IndexResult with actual FTS status
- Check actual FTS index status using `table.list_indices()`
- Update async version to use `index_result_obj.fts_enabled` like sync version

**Problem:** `create_index()` returned ad-hoc string format
`"status advice: message"` parsed via fragile string split.

**Solution:**
- Create IndexResult Pydantic model in core/schemas.py
- Update VectorIndexStore.create_index() contract to return IndexResult
- Add fields: status, advice, fts_enabled
- Update all call sites to use structured field access

- Remove redundant `get_vector_index_store()` calls (Finding 4)
- Fix `get_raw_connection()` caching inconsistency (Finding 5)

- All search_sparse tests passing (10/10)
- All search_dense tests passing
- Tests now use IndexResult objects instead of string mocks

**Core Implementation:**
- `core/schemas.py`: Add IndexResult Pydantic model
- `storage/contracts.py`: Update create_index contract, add IndexResult import
- `retrieval/search_engine.py`: Use IndexResult, remove redundant call
- `retrieval/search_sparse.py`: Use IndexResult, fix async FTS detection
- `storage/lancedb_stores.py`: Return IndexResult, check actual FTS status, fix caching

**Test Files:**
- Update all mock `create_index.return_value` to return IndexResult
- Add IndexResult imports to test files
This commit completes several Phase 1A improvements to fully decouple
the retrieval layer from LanceDB-specific implementations:

1. Added convenience methods to VectorIndexStore contract:
   - search_vectors_by_model(): sync vector search by model_tag
   - search_vectors_by_model_async(): async vector search by model_tag
   - search_fts_by_model_async(): async FTS search by model_tag

   These methods internally combine open_embeddings_table() with the
   respective search operations, providing a simpler one-step API for the
   most common use case of searching by model_tag.

2. Updated retrieval layer to use new convenience methods:
   - search_dense_engine (sync/async): now uses search_vectors_by_model
   - search_sparse_async: now uses search_fts_by_model_async

   This eliminates the previous two-step pattern of opening the table first
   then passing the table_name to search functions.

3. Removed legacy IndexManager code:
   - Deleted vector_storage/index_manager.py (254 lines)
   - Removed _open_embeddings_table, validate_embed_model, and
     get_stored_vector_dimension functions from vector_manager.py
   - Updated vector_storage/__init__.py exports
   - Removed corresponding tests (test_index_manager.py and related tests)

4. Legacy fallback logic now centralized:
   - All table opening and legacy fallback logic is now handled by
     VectorIndexStore.open_embeddings_table()
   - This provides a single source of truth for table name resolution
   - Both sync and async code paths benefit from unified implementation

5. Simplified validation:
   - validate_query_vector no longer requires database connection
   - Dimension validation is now handled by the storage abstraction layer
   - Removed conn parameter and DB-dependent validation logic

Changes:
- storage/contracts.py: Added 3 convenience methods with default implementations
- storage/lancedb_stores.py: Added open_embeddings_table() implementation
- retrieval/search_engine.py: Updated to use search_vectors_by_model
- retrieval/search_sparse.py: Updated to use convenience methods
- retrieval/search_dense.py: Removed raw connection usage for validation
- vector_storage/*: Removed legacy IndexManager and related functions
- tests/*: Updated tests to use new abstraction layer, removed deleted function tests

Net changes: +439, -1515 lines across 19 files
…ch issues

This commit completes the test adaptation for Phase 1A storage decoupling
and fixes critical issues in sparse search implementation.

- **Problem**: search_sparse was pre-transforming model_tag with "embeddings_"
  prefix, then open_embeddings_table would add it again, causing
  "embeddings_embeddings_model" table name errors.
- **Fix**: Removed pre-transformation in search_sparse; open_embeddings_table
  now handles all prefix logic internally.
- **Impact**: Sparse search now correctly resolves table names for both
  Hub ID and legacy provider-based naming.

- **Problem**: create_index() in readonly mode returned fts_enabled=False
  without checking if FTS index already existed.
- **Fix**: In readonly mode, open table and check existing FTS index status
  before returning IndexResult.
- **Impact**: Readonly queries can now correctly detect and use existing
  FTS indexes, improving test coverage and production query accuracy.

- Updated test mocks to use search_vectors_by_model() instead of
  raw connection pattern.
- Removed build_filter_expression assertions (now called inside
  abstraction layer).
- Fixed validate_query_vector call expectations (no longer passes
  model_tag parameter).
- Added `import unittest` for unittest.mock.ANY usage.

- Updated tests to patch get_vector_index_store instead of
  get_vector_store_raw_connection.
- Fixed open_embeddings_table return value expectations (now returns
  tuple (table, table_name)).
- Updated model_tag assertions to expect untransformed values
  (abstraction layer handles prefix).
- Fixed test_search_sparse_database_error to patch resolve_embedding_adapter
  in correct location (utils.model_resolver).

- Updated all create_index mock returns to return IndexResult objects
  instead of strings.
- Fixed test_write_vectors_index_status_aggregation to use
  IndexResult objects in side_effect.

- test_search_dense.py: 13 passed
- test_search_sparse.py: 10 passed
- test_collections.py: 8 passed
- test_vector_manager.py: 37 passed (1 skipped)
- test_document_search.py: 8 passed
- **Total: 76 passed, 1 skipped**

```python
model_tag = f"embeddings_{to_model_tag(model_tag)}"  # "embeddings_test_model"
table, model_tag = vector_store.open_embeddings_table(model_tag)
```

```python
table, actual_table_name = vector_store.open_embeddings_table(model_tag)
```

```python
if readonly:
    return IndexResult(
        status="readonly",
        advice=...,
        fts_enabled=False,  # Always False!
    )
```

```python
if readonly:
    fts_enabled = False
    try:
        table = conn.open_table(table_name)
        indexes = table.list_indices()
        fts_enabled = any(
            idx.index_type == "FTS" and "text" in idx.columns
            for idx in indexes
        )
    except Exception as e:
        logger.debug("Unable to check FTS index status: %s", e)

    return IndexResult(
        status="readonly",
        advice=...,
        fts_enabled=fts_enabled,  # Actual status!
    )
```
This commit addresses the remaining issues from PR xorbitsai#158 review:

1. Exception Handling Consistency:
   - Add try-except block to search_dense (Sync) to return structured error responses
   - All four search entry points now have unified exception handling pattern:
     * search_dense (Sync) - now returns DenseSearchResponse with status="failed"
     * search_dense_async (Async) - already had exception handling
     * search_sparse (Sync) - already had exception handling
     * search_sparse_async (Async) - already had exception handling

2. Test Mock Level Improvements:
   - Replace mock-based tests with real storage tests in TestSyncFunctions
   - Add integration test for rebuild_collection_metadata
   - Update multitenancy tests to use CollectionManager instead of non-existent register_collection
   - All tests now verify complete data flow through storage layer

3. Verified Fixes:
   - Model field read/write consistency: both paths use embedding_config.id
   - FTS detection: uses IndexResult.fts_enabled instead of assuming create_index success
   - IndexResult: structured return type replaces fragile string parsing

All 52 related tests pass successfully.
@sqhyz55 sqhyz55 force-pushed the feat/kb-phase1a-storage-decoupling branch from 39755dd to b325783 Compare April 7, 2026 09:09
sqhyz55 added 2 commits April 7, 2026 17:39
…nections

The original test had a design flaw where all threads shared the same LanceDB
connection object, which is not thread-safe and causes transaction conflicts.

Changes:
1. Removed threading lock from schema_manager.py (wrong approach)
2. Fixed test_concurrent_ensure_collection_metadata_table_is_safe to use separate
   connections for each thread

This correctly tests table creation idempotency without hitting LanceDB's
threading limitations.
Fixed a typo where 'last_accessed' was used instead of 'last_accessed_at'
in the ensure_collection_metadata_table function.

This caused the test_ensure_schema_fields_idempotency test to fail because:
1. First call created table with 'last_accessed' field
2. Second call tried to add 'last_accessed_at' field (different name)
3. LanceDB's schema alignment failed due to field name mismatch

The fix ensures field name consistency across the schema definition.

All 16 schema migration tests now pass successfully.

@rogercloud rogercloud left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-Review: All Previous Comments Addressed

I've re-verified all 19 review comments against the latest code.

Critical (3/3 FIXED)

  1. aggregate_collection_stats OOM → uses iter_batches() with columns=["collection"]
  2. Inconsistent model field write/read → both paths use embedding_config.id
  3. user_id or 0 losing configs → is_admin and user_id=None handled explicitly

Major (6/6 FIXED)

  1. asyncio.run() per collection → _run_in_separate_loop() with per-thread event loop
  2. get_connection_from_env duplicated in 12+ files → only in lancedb_stores.py now
  3. Auto-migration during read → explicit migrate_embeddings_table() method
  4. Inconsistent error handling in search → unified storage abstraction
  5. Hub resolution silently swallows errors → logs warning with full context
  6. search_sparse_async FTS assumption → uses create_index().fts_enabled checking actual status

Minor (8/10 FIXED, 1 not fixed, 1 waived)

10-12, 15-19: All fixed (tag collision via register_tag_mapping(), structured IndexResult, fixture consolidation, etc.)
14: Tests mock at module level — minor, can be a follow-up
13: One-off migration script — acceptable

LGTM, all critical and major issues resolved.

@qinxuye qinxuye merged commit 98e68b3 into xorbitsai:main Apr 9, 2026
3 checks passed
@AldenWangExis

Copy link
Copy Markdown
Collaborator

congratulations

@qinxuye

qinxuye commented Apr 9, 2026

Copy link
Copy Markdown
Contributor

congratulations

Can you join the force for the next steps as we discussed before?

@AldenWangExis

Copy link
Copy Markdown
Collaborator

congratulations

Can you join the force for the next steps as we discussed before?

Yeah, I’ll review the updates from recent weeks to evaluate the functions and module boundaries. After that, I'll start working on adding Milvus support as discussed.

@AldenWangExis

AldenWangExis commented May 12, 2026

Copy link
Copy Markdown
Collaborator

congratulations

Can you join the force for the next steps as we discussed before?

Yeah, I’ll review the updates from recent weeks to evaluate the functions and module boundaries. After that, I'll start working on adding Milvus support as discussed.

Following up on the Milvus step after #158.

Opened #382 for the narrow path:

  • Milvus handles KB embeddings + dense vector search
  • LanceDB stays as the non-vector delegate
  • sparse/hybrid under Milvus degrades explicitly
  • no RDB metadata migration, no API behavior changes

This should stay orthogonal to #219.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Proposal Draft: Split Control Plane Metadata from Vector Indexing in KB

5 participants