Summary
Compute vector embeddings for each normalized article and provide a semantic search interface to find similar articles by content.
Motivation
- Enables “More like this” recommendations and clustering of related stories.
- Powers a search experience that goes beyond keyword matching, surfacing semantically relevant results.
Scope
In scope: implementation, tests
Acceptance Criteria
Additional Context
- Add dependencies
- Add
sentence-transformers and faiss-cpu (or equivalent) to /nlp/requirements.txt.
- Core function signatures (
/nlp/core.py)
def compute_embedding(text: str) -> List[float]
def index_article(article_id: str, embedding: List[float]) -> None
def search_similar(query: str, top_k: int = 5) -> List[Dict]
- Celery task hook (
/nlp/tasks.py)
- Register:
@app.task
def embed_task(article_id: str) -> List[float]
- Should call
compute_embedding, then index_article.
- CLI entrypoints (
/nlp/cli.py)
python -m nlp.cli embed --article-id=<id>
python -m nlp.cli search --query="..." --top-k=5
- Tests & documentation
- Create
/nlp/tests/test_core_embedding.py to:
- Assert
compute_embedding() returns a vector of the expected dimension.
- Assert that
search_similar() returns a non-empty list for a sample query.
- Create
/nlp/tests/test_embed_task.py to:
- Mock DB and vector store, verify
embed_task() calls both core functions.
- Update
/nlp/README.md with:
- Installation steps
- How to run
embed_task via Celery
- CLI usage examples for
embed and search
Summary
Compute vector embeddings for each normalized article and provide a semantic search interface to find similar articles by content.
Motivation
Scope
In scope: implementation, tests
Acceptance Criteria
compute_embedding(text)returns a fixed-length float vector.embed_task(article_id)stores the embedding for the article in the vector index.search_similar(query, top_k)returns the top K most semantically similar articles.embedandsearchrun without errors and print expected output.Additional Context
sentence-transformersandfaiss-cpu(or equivalent) to/nlp/requirements.txt./nlp/core.py)def compute_embedding(text: str) -> List[float]def index_article(article_id: str, embedding: List[float]) -> Nonedef search_similar(query: str, top_k: int = 5) -> List[Dict]/nlp/tasks.py)compute_embedding, thenindex_article./nlp/cli.py)python -m nlp.cli embed --article-id=<id>python -m nlp.cli search --query="..." --top-k=5/nlp/tests/test_core_embedding.pyto:compute_embedding()returns a vector of the expected dimension.search_similar()returns a non-empty list for a sample query./nlp/tests/test_embed_task.pyto:embed_task()calls both core functions./nlp/README.mdwith:embed_taskvia Celeryembedandsearch