Searchlight

Production-grade document retrieval API with hybrid search combining BM25 keyword matching and HNSW vector similarity using Apache Lucene 9+.

Features

Hybrid Search: Combine keyword (BM25) and vector (HNSW) search with adjustable weighting
High Performance: Lucene HNSW for fast approximate nearest neighbor search
Hexagonal Architecture: Clean separation of domain, ports, and adapters
Observability: Prometheus metrics, OpenTelemetry instrumentation, Grafana dashboards
🔌 Pluggable Embeddings: HTTP-based or ONNX Runtime providers
📰 RSS Ingestion: Automated document ingestion from RSS feeds and URLs
Modern Dashboard: Next.js 14 UI with real-time search
Docker Ready: Complete docker-compose setup with all services
90%+ Test Coverage: Comprehensive test suite with JUnit 5, WireMock, Testcontainers

Architecture

┌─────────────────────────────────────────────────────────┐
│                    REST API Layer                        │
│  (SearchController, AdminController, HealthController)   │
└──────────────────────┬──────────────────────────────────┘
                       │
┌──────────────────────┴──────────────────────────────────┐
│                   Domain Layer                           │
│   Models: DocumentChunk, SearchQuery, SearchResult      │
│   Ports: EmbeddingProvider, Indexer, Searcher           │
└──────────────────────┬──────────────────────────────────┘
                       │
┌──────────────────────┴──────────────────────────────────┐
│               Infrastructure Layer                       │
│                                                          │
│  ┌─────────────────┐  ┌──────────────────────────────┐ │
│  │ Lucene HNSW     │  │ Embedding Providers          │ │
│  │ - Indexer       │  │ - HttpEmbeddingProvider      │ │
│  │ - Searcher      │  │ - OnnxEmbeddingProvider      │ │
│  └─────────────────┘  └──────────────────────────────┘ │
│                                                          │
│  ┌─────────────────────────────────────────────────────┐│
│  │ Ingestion Pipeline                                  ││
│  │ - RssIngestService                                  ││
│  │ - HtmlCleaner (jsoup)                              ││
│  │ - Chunker (token-based splitting)                  ││
│  └─────────────────────────────────────────────────────┘│
└──────────────────────────────────────────────────────────┘

Quick Start

Prerequisites

Java 21+
Docker & Docker Compose (optional)
Node.js 20+ (for dashboard)

Option 1: Docker Compose (Recommended)

# Start all services
docker-compose up --build

# API available at http://localhost:8080
# Dashboard at http://localhost:3000
# Prometheus at http://localhost:9090
# Grafana at http://localhost:3001 (admin/admin)

Option 2: Local Development

# Run API in dev mode (with ONNX stub embeddings)
make dev

# Or manually:
./gradlew bootRun --args='--spring.profiles.active=dev'

Option 3: Using Makefile

# See all available commands
make help

# Build and test
make build
make test

# Run with Docker
make docker-up

Quick Demo

Once the API is running (via make dev or docker-compose up), run the interactive demo:

make demo

This will:

Ingest sample documents from Hacker News RSS
Run search with alpha=0.0 (pure BM25 keyword search)
Run search with alpha=1.0 (pure KNN vector search)
Show how rankings differ based on the alpha parameter

Expected output:

🔦 Searchlight Hybrid Search Demo
====================================

📥 Ingesting sample RSS feed (Hacker News)...
✅ Indexed 15 documents, 73 chunks

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
🔍 Search 1: Pure Keyword (alpha=0.0, BM25 only)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  [0.95] Best Programming Languages for 2025
  [0.82] Learn to Code: A Beginner's Guide
  [0.71] Programming Paradigms Explained
  ...

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
🔍 Search 2: Pure Vector (alpha=1.0, KNN only)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  [0.91] Machine Learning Fundamentals
  [0.87] Deep Learning in Practice
  [0.79] Building Neural Networks
  ...

✅ Demo complete! Rankings differ based on alpha parameter.

📖 API Documentation

Swagger UI

Once running, access interactive API docs at:

http://localhost:8080/swagger-ui.html

Core Endpoints

Search Documents

curl -X POST http://localhost:8080/api/v1/search \
  -H "Content-Type: application/json" \
  -d '{
    "q": "machine learning",
    "k": 10,
    "alpha": 0.5
  }'

Request Body (copy to Swagger UI):

{
  "q": "machine learning neural networks",
  "k": 5,
  "alpha": 0.5
}

Parameters:

q (string, required): Query text
k (int, optional): Number of results (default: 10)
alpha (float 0-1, optional): Hybrid weight (default: 0.5)
- 0.0 = pure keyword search (BM25 only)
- 1.0 = pure vector search (KNN only)
- 0.5 = balanced hybrid search
offset (int, optional): Pagination offset (default: 0)

Sample Response:

{
  "query": "machine learning neural networks",
  "results": [
    {
      "id": "doc-1-chunk-0",
      "sourceId": "doc-1",
      "title": "Introduction to Neural Networks",
      "url": "https://example.com/neural-networks",
      "snippet": "Neural networks are the foundation of modern machine learning...",
      "score": 0.85,
      "keywordScore": 0.72,
      "vectorScore": 0.91,
      "source": "hacker-news",
      "timestamp": "2025-10-03T10:30:00Z"
    }
  ],
  "total": 1,
  "took": 45
}

Ingest Documents

curl -X POST http://localhost:8080/api/v1/admin/ingest \
  -H "Content-Type: application/json" \
  -d '{
    "urls": ["https://news.ycombinator.com/rss"],
    "mode": "RSS"
  }'

Request Body (copy to Swagger UI):

{
  "urls": [
    "https://news.ycombinator.com/rss",
    "https://example.com/blog/feed.xml"
  ],
  "mode": "RSS"
}

Sample Response:

{
  "message": "Ingestion completed",
  "documentsProcessed": 25,
  "chunksIndexed": 142,
  "errors": 0
}

Get Document by ID

curl http://localhost:8080/api/v1/docs/{id}

🔄 Reindex

curl -X POST http://localhost:8080/api/v1/admin/reindex

Health Check

curl http://localhost:8080/api/v1/health

Metrics (Prometheus)

curl http://localhost:8080/actuator/prometheus

Dashboard

The Next.js dashboard provides a beautiful UI for searching:

Real-time search with results
Adjustable alpha slider for hybrid search tuning
Result scoring visualization
Responsive design

Running Dashboard Locally

cd dashboard
npm install
npm run dev
# Open http://localhost:3000

🧪 Testing

# Run all tests
./gradlew test

# Generate coverage report
./gradlew jacocoTestReport

# View coverage
open build/reports/jacoco/test/html/index.html

Test Categories

Unit Tests: Domain logic, embeddings, chunking
Integration Tests: Lucene indexing & searching
API Tests: Controller endpoints with MockMvc
E2E Tests: Full stack smoke tests

⚙ Configuration

application.yaml

searchlight:
  index:
    path: data/index
    similarity: COSINE  # COSINE, DOT_PRODUCT, EUCLIDEAN
    hnsw:
      m: 16
      ef-construction: 100
  
  embedding:
    provider: onnx  # http or onnx
    url: http://localhost:8000/embed
    dimension: 384
    timeout: 30000
  
  chunker:
    size: 512
    overlap: 50

Environment Variables

SEARCHLIGHT_EMBEDDING_PROVIDER=http|onnx
SEARCHLIGHT_EMBEDDING_URL=http://embedder:8000/embed
SEARCHLIGHT_EMBEDDING_DIMENSION=384
SPRING_PROFILES_ACTIVE=dev|ci|prod

Performance Benchmarks

Run load tests with k6:

make bench

Sample Results (local machine, mock embeddings):

Metric	Value
Requests/sec	~200 RPS
P50 Latency	45ms
P95 Latency	120ms
P99 Latency	180ms
Error Rate	<0.1%

Note: Performance varies based on index size, hardware, and embedding provider.

Development

Project Structure

searchlight/
├── src/main/java/com/searchlight/
│   ├── app/              # Spring Boot application & config
│   ├── domain/           # Core domain models & ports
│   │   ├── model/
│   │   └── ports/
│   ├── infra/            # Infrastructure implementations
│   │   ├── embeddings/   # Embedding providers
│   │   ├── index/        # Lucene HNSW
│   │   └── ingest/       # RSS/HTML processing
│   └── api/              # REST controllers & DTOs
│       ├── controller/
│       └── dto/
├── src/test/java/com/searchlight/
│   ├── fixtures/
│   ├── infra/
│   ├── api/
│   └── e2e/
├── dashboard/            # Next.js frontend
├── scripts/              # Helper scripts
├── config/               # Prometheus, Grafana configs
└── docker-compose.yml

Adding a New Embedding Provider

Implement EmbeddingProvider interface
Add @ConditionalOnProperty for configuration
Register in Spring context
Update configuration

Example:

@Component
@ConditionalOnProperty(name = "searchlight.embedding.provider", havingValue = "custom")
public class CustomEmbeddingProvider implements EmbeddingProvider {
    // Implementation...
}

🔬 Observability

Metrics

Key metrics exposed via Prometheus:

search_requests_total - Total search requests
search_latency - Search latency histogram
embedding_latency - Embedding generation time
index_docs_count - Total documents in index
ingest_documents_total - Documents ingested
ingest_errors_total - Ingestion errors

Grafana Dashboards

Pre-configured dashboards available at http://localhost:3001:

Request rates and latencies
JVM metrics (heap, GC, threads)
Index statistics
Error rates

🗺 Roadmap

Contributing

Contributions welcome! Please:

Fork the repository
Create a feature branch
Add tests for new functionality
Ensure ./gradlew build passes
Submit a pull request

License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Apache Lucene - High-performance search library
Spring Boot - Application framework
Next.js - React framework for dashboard
HNSW - Hierarchical Navigable Small World algorithm

📧 Contact

For questions or feedback, please open an issue on GitHub.

Built with using Java 21, Spring Boot 3, and Apache Lucene

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.github		.github
.meta		.meta
config		config
dashboard		dashboard
gradle		gradle
scripts		scripts
src		src
.dockerignore		.dockerignore
.editorconfig		.editorconfig
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
build.gradle.kts		build.gradle.kts
docker-compose.override.yml.example		docker-compose.override.yml.example
docker-compose.yml		docker-compose.yml
gradlew		gradlew
gradlew.bat		gradlew.bat
settings.gradle.kts		settings.gradle.kts

Folders and files

Latest commit

History

Repository files navigation

Searchlight

Features

Architecture

Quick Start

Prerequisites

Option 1: Docker Compose (Recommended)

Option 2: Local Development

Option 3: Using Makefile

Quick Demo

📖 API Documentation

Swagger UI

Core Endpoints

Search Documents

Ingest Documents

Get Document by ID

🔄 Reindex

Health Check

Metrics (Prometheus)

Dashboard

Running Dashboard Locally

🧪 Testing

Test Categories

⚙ Configuration

application.yaml

Environment Variables

Performance Benchmarks

Development

Project Structure

Adding a New Embedding Provider

🔬 Observability

Metrics

Grafana Dashboards

🗺 Roadmap

Contributing

License

🙏 Acknowledgments

📧 Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages