This document outlines the architectural design for Endee's filtering system, covering component designs for Numeric, Category, and Boolean types, and the overarching execution strategy.
The system prioritizes Pre-Filtering followed by an adaptive search execution path.
- Filter Analysis:
- Incoming queries (e.g.,
Age: [18-25] AND City: "NY") are broken into atomic filter operations. - Cardinality Estimation: Each filter estimates its result set size (e.g., "NY" has 500 users, "Age" has 10k).
- Incoming queries (e.g.,
- Optimization (Cheapest First):
- Filters are executed in order of increasing cardinality (smallest first).
- Results are intersected (
AND) incrementally. If the intermediate result becomes empty, execution stops early.
- Adaptive Search Path:
- Final
RoaringBitmapof valid IDs is passed to the Vector Search engine. - Small Result (< 1,000 IDs): Bypass HNSW. Fetch vectors for valid IDs directly and perform Brute Force distance calculation. This avoids graph overhead for sparse results.
- Large Result: Filtered HNSW. Pass the Bitmap to HNSW's
searchKnnviaBitMapFilterFunctor.
- Final
Optimized for range queries, high compression, and sequential access.
The database (LMDB) acts as a coarse-grained B+ Tree.
-
Key:
[FieldID] + [Base_Value_32bit].- Floats are mapped to lexicographically ordered integers to preserve sort order.
- Keys are stored in Big-Endian to support native cursor iteration.
-
Value (Bucket): Fixed-size block (Max 1024 unique values).
-
Summary Bitmap (Roaring): Pre-computed union of all IDs in the bucket. Used for
$O(1)$ block retrieval during full overlaps. -
Data Arrays (Structure of Arrays - SoA):
-
Values: Compressed as
uint16_tdeltas relative to the Key'sBase_Value. -
IDs: Raw
idIntarray, index-aligned with values.
-
Values: Compressed as
-
Summary Bitmap (Roaring): Pre-computed union of all IDs in the bucket. Used for
- Buckets Fully Inside Selection (Middle): Use Summary Bitmap. Zero array access.
- Buckets Partially Overlapping (Edges): Scan
Valuesarray (SIMD), use indices to fetch specificIDs.
- Split Triggers: Count > 1024 OR Delta > 65,535.
- Sliding Split: To ensure Key Uniqueness in LMDB, splits do not strictly occur at the median. The split point "slides" right to find the first value divergence, ensuring
Key(RightBucket) != Key(LeftBucket).
Optimized for exact match lookups and faceting.
-
Single Value:
{"City": "NY"} -
List Membership ($in):
{"City": {"$in": ["NY", "London", "Tokyo"]}}
Utilizes Inverted Indices with Text-Based Keys to enable prefix scanning and faceting.
- Key:
[FieldName] + ":" + [Value].- Parsing Logic: The system strictly splits on the first occurrence of
:. - Format:
City:New:Yorkis parsed as Field=City, Value=New:York. - Constraints:
FieldNamemust not contain the:character (alphanumeric + underscore recommended).Valuecan contain any character including:.
- Parsing Logic: The system strictly splits on the first occurrence of
- Value:
RoaringBitmap(Serialized). Contains all IDs that have this attribute value.
- Exact Match: Direct Key lookup.
- $in Query:
- Parse the list
["NY", "London"]. - Perform multiple Key lookups.
- Compute the Union of the resulting Bitmaps efficiently.
- Parse the list
Optimized for extreme density ops.
Treated as a specialized Category filter with strictly two possible keys per field.
- Keys:
[FieldName]:0(False) and[FieldName]:1(True).- Consistent with the text-based key design (uses
:separator).
- Consistent with the text-based key design (uses
- Value:
RoaringBitmap.
Boolean filters are typically low-selectivity (often matching ~50% of the DB). They are processed Last in the intersection chain unless statistics indicate high skew (e.g., Is_Active is true for 99% of data, so filtering for False is fast).
To ensure index integrity without a strict schema registry, the system adheres to First-Write Wins typing.
- Immutable Types: Once a
FieldNameis indexed with a specific type (Numeric, Category, or Boolean), that type is bound to the field. - Validation Logic:
- If
is_activeis first seen as Boolean, subsequent attempts to insertis_active: "yes"(Category) oris_active: 1(Numeric bucket) must be rejected. - This prevents storage corruption and ambiguous query parsing.
- If