Semantic Indexing Batch 01

Author: John Mitchell (@whmatrix) Status: SUPERSEDED (by batch-02) Audience: Researchers / Learners Environment: CPU sufficient (pre-built indices included) Fast Path: See batch-02 mini-index

This repository represents early foundational work. For the current production implementation, see semantic-indexing-batch-02.

661,525 vectors · 6 datasets · e5-large-v2 · FAISS IndexFlatIP

Semantic Indexing Batch 01

This repository contains the final indexing outputs for 6 open datasets, processed with a GPU-based pipeline using the intfloat/e5-large-v2 model and FAISS IndexFlatIP.

All datasets were cleaned and normalized, chunked into 512–800 token segments, embedded in FP16 with batch_size=1300, and indexed into FAISS for semantic search.

Datasets

The portfolio_index_results/ directory contains one folder per dataset:

20_newsgroups/
simplewiki/
imdb/
stackoverflow/
ag_news/
disaster_tweets/

Each dataset folder includes: chunks.jsonl, metadata.jsonl, summary.json, vectors.index, index_info.json

Total: 661,525 vectors across 6 datasets.

Tech Stack

Python
CUDA / PyTorch (FP16 inference)
FAISS (IndexFlatIP)
e5-large-v2 encoder
JSONL pipelines

Project Overview

A complete semantic indexing pipeline for large-scale text datasets. Each dataset moves through: cleaning → chunking → embedding → indexing → verification.

Deliverables

chunks.jsonl
metadata.jsonl
summary.json
vectors.index
index_info.json

Pipeline Diagram

Raw Text
   ↓
Cleaning & Normalization
   ↓
Chunking (512–800 tokens)
   ↓
Embedding (e5-large-v2 · FP16 · GPU)
   ↓
Vector Index (FAISS IndexFlatIP)
   ↓
Verification & Stats
   ↓
RAG-Ready Dataset

Repository Structure

portfolio_index_results/
 ├── 20_newsgroups/
 ├── ag_news/
 ├── disaster_tweets/
 ├── imdb/
 ├── simplewiki/
 └── stackoverflow/

How to Use These Indexes

import faiss
index = faiss.read_index("vectors.index")

What's Actually In This Repository

Included

portfolio_index_results/ — Pre-built FAISS indices for all 6 datasets
- Each contains: chunks.jsonl, metadata.jsonl, summary.json, vectors.index, index_info.json
Ready to load and query immediately (see "How to Use These Indexes" above)

Not Included

Source datasets (publicly available; see dataset documentation)
Indexing scripts (see semantic-indexing-batch-02 for the production pipeline)

Quickest Proof

This batch is superseded. For a runnable demo, see the mini-index in batch-02:

git clone https://github.com/whmatrix/semantic-indexing-batch-02
cd semantic-indexing-batch-02/mini-index
pip install sentence-transformers faiss-cpu
python demo_query.py

Badges

Limitations & Non-Claims

This batch provides a foundational semantic index demonstrating the indexing pipeline. It is superseded by semantic-indexing-batch-02 (8.35M+ vectors, larger scale). No retrieval quality metrics or benchmarking are provided; this is proof-of-concept infrastructure. Index outputs are not tuned for any specific application domain.

Routing

This repo has been superseded by the production portfolio:

Current Production: semantic-indexing-batch-02 (8.3M+ vectors)
Canonical Protocol: Universal Protocol v4.23

Protocol Alignment

This indexing run conforms to the Universal Protocol v4.23.

All dataset ingestion, chunking, embedding, FAISS construction, and validation artifacts follow the schemas and constraints defined there.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
.github/workflows		.github/workflows
portfolio_index_results		portfolio_index_results
.gitattributes		.gitattributes
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
index.html		index.html
search_demo.py		search_demo.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Semantic Indexing Batch 01

Datasets

Tech Stack

Project Overview

Deliverables

Pipeline Diagram

Repository Structure

How to Use These Indexes

What's Actually In This Repository

Included

Not Included

Quickest Proof

Badges

Limitations & Non-Claims

Routing

Protocol Alignment

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

whmatrix/semantic-indexing-batch-01

Folders and files

Latest commit

History

Repository files navigation

Semantic Indexing Batch 01

Datasets

Tech Stack

Project Overview

Deliverables

Pipeline Diagram

Repository Structure

How to Use These Indexes

What's Actually In This Repository

Included

Not Included

Quickest Proof

Badges

Limitations & Non-Claims

Routing

Protocol Alignment

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages