Skip to content

Implement Term Frequency (TF) Support in Indexer and Ranker #24

@Digvijay-x1

Description

@Digvijay-x1

Description

Currently, the search engine's ranking quality is limited because we only store a list of doc_ids in the inverted index (RocksDB). This means we assume a Term Frequency (TF) of 1 for all matches, preventing effective BM25 ranking.

We need to update the Indexer to calculate and store TF, and update the Ranker to utilize it.

Tasks

  • Indexer (C++): Modify indexer /src/main.cpp to calculate TF during tokenization.
  • Indexer (C++): Update RocksDB storage format to store doc_id:tf pairs (e.g., 123:4,125:1) instead of just doc_ids.
  • Ranker (Python): Update ranker/engine.py to parse the new value format from RocksDB.
  • Ranker (Python): Update the BM25 calculation in search() to use the actual TF value.

Acceptance Criteria

  • Indexer stores TF information for each token-document pair.
  • Ranker successfully parses the new format without crashing.
  • Search scores strictly reflect term frequency (a document with 5 occurrences of a term should rank higher than one with 1, all else being equal).

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions