SpecForge — AI-Powered System Design Spec Generator

An AI-powered application that generates comprehensive system design specifications. Input your project idea, answer targeted questions, and receive a detailed architectural specification with diagrams, data models, API designs, and implementation plans — powered by any OpenAI-compatible LLM endpoint or locally running Ollama model.

SpecForge — AI-Powered System Design Spec Generator

Project Overview

SpecForge demonstrates how large language models can be used to generate production-ready system design specifications. It supports multiple LLM providers and works with any OpenAI-compatible inference endpoint or a locally running Ollama instance.

This makes SpecForge suitable for:

Enterprise deployments — connect to a GenAI Gateway or any managed LLM API
Air-gapped environments — run fully offline with Ollama and a locally hosted model
Local experimentation — quick setup with GPU-accelerated inference
Professional documentation — generate specs that guide AI coding tools

How It Works

The user enters a project idea in the browser
The React frontend sends the idea to the FastAPI backend
The backend generates 5 targeted clarifying questions using the configured LLM
The user answers the questions
The backend constructs a detailed prompt and streams the spec generation
The LLM returns a comprehensive 9-section specification with diagrams
The user can refine the spec through conversational feedback

All inference logic is abstracted behind a single INFERENCE_PROVIDER environment variable — switching between providers requires only a .env change and a container restart.

Architecture

The application follows a modular two-service architecture with a React frontend and a FastAPI backend. The backend handles all inference orchestration and optional LLM observability. The inference layer is fully pluggable — any OpenAI-compatible remote endpoint or a locally running Ollama instance can be used without code changes.

Architecture Diagram

graph TB
    subgraph "User Interface (port 3000)"
        A[React Frontend]
        A1[Idea Input]
        A2[Question/Answer Flow]
        A3[Spec Viewer]
    end

    subgraph "FastAPI Backend (port 8000)"
        B[API Server]
        C[API Client]
    end

    subgraph "Inference - Option A: Remote"
        E[OpenAI / Groq / OpenRouter<br/>Enterprise Gateway]
    end

    subgraph "Inference - Option B: Local"
        F[Ollama on Host<br/>host.docker.internal:11434]
    end

    A1 --> B
    A2 --> B
    A3 --> B
    B --> C
    C -->|INFERENCE_PROVIDER=remote| E
    C -->|INFERENCE_PROVIDER=ollama| F
    E -->|Specification| C
    F -->|Specification| C
    C --> B
    B --> A

Service Components

Service	Container	Host Port	Description
`specforge-api`	`specforge-api`	`8000`	FastAPI backend — question generation, spec generation, refinement
`specforge-ui`	`specforge-ui`	`3000`	React frontend — served by dev server or Nginx in production

Ollama is intentionally not a Docker service. On macOS (Apple Silicon), running Ollama in Docker bypasses Metal GPU acceleration, resulting in CPU-only inference. Ollama must run natively on the host so the backend container can reach it via host.docker.internal:11434.

Get Started

Prerequisites

Before you begin, ensure you have the following installed and configured:

Docker and Docker Compose (v2)
- Install Docker
- Install Docker Compose
An inference endpoint — one of:
- A remote OpenAI-compatible API key (OpenAI, Groq, OpenRouter, or enterprise gateway)
- Ollama installed natively on the host machine

Verify Installation

docker --version
docker compose version
docker ps

Quick Start (Docker Deployment)

1. Clone the Repository

git clone https://github.com/cld2labs/SpecForge.git
cd SpecForge

2. Configure the Environment

cp .env.example .env

Open .env and set INFERENCE_PROVIDER plus the corresponding variables for your chosen provider. See LLM Provider Configuration for per-provider instructions.

3. Build and Start the Application

# Standard (attached)
docker compose up --build

# Detached (background)
docker compose up -d --build

4. Access the Application

Once containers are running:

Frontend UI: http://localhost:3000
Backend API: http://localhost:8000
API Docs (Swagger): http://localhost:8000/docs

5. Verify Services

# Health check
curl http://localhost:8000/health

# View running containers
docker compose ps

View logs:

# All services
docker compose logs -f

# Backend only
docker compose logs -f specforge-api

# Frontend only
docker compose logs -f specforge-ui

6. Stop the Application

docker compose down

Local Development Setup

Run the backend and frontend directly on the host without Docker.

Backend (Python / FastAPI)

cd backend
python -m venv .venv
source .venv/bin/activate        # Windows: .venv\Scripts\activate
pip install -r requirements.txt
cp ../.env.example ../.env       # configure your .env at the repo root
uvicorn main:app --reload --port 8000

Frontend (Node / Vite)

cd frontend
npm install
npm run dev

The Vite dev server proxies /api/ to http://localhost:8000. Open http://localhost:5173.

Project Structure

SpecForge/
├── backend/                    # FastAPI backend
│   ├── config.py               # Environment-driven settings
│   ├── main.py                 # FastAPI app with lifespan
│   ├── models/
│   │   └── schemas.py          # Pydantic request/response models
│   ├── routers/
│   │   ├── questions.py        # Question generation endpoint
│   │   ├── generate.py         # Spec generation (streaming SSE)
│   │   └── refine.py           # Spec refinement endpoint
│   ├── services/
│   │   ├── api_client.py       # Unified LLM inference client
│   │   └── __init__.py
│   ├── prompts/
│   │   ├── generate_questions.txt
│   │   ├── generate_spec.txt
│   │   └── refine_spec.txt
│   ├── Dockerfile
│   └── requirements.txt
├── frontend/                   # React frontend
│   ├── src/
│   │   ├── App.jsx
│   │   ├── components/
│   │   └── main.jsx
│   ├── Dockerfile
│   └── package.json
├── .github/
│   └── workflows/
│       └── code-scans.yaml     # CI/CD security scans
├── docker-compose.yaml         # Service orchestration
├── .env.example                # Environment variable reference
├── README.md
├── CONTRIBUTING.md
├── SECURITY.md
├── DISCLAIMER.md
└── LICENSE.md

Usage Guide

Generate a specification:

Open http://localhost:3000
Enter your project idea (e.g., "A food delivery app like UberEats")
Click "Generate Questions"
Answer the 5 targeted questions
Click "Generate Specification"
Watch the spec stream in real-time
Download as markdown or refine with conversational feedback

Refine your spec:

Use the chat interface below the spec
Ask for changes (e.g., "Add a caching layer" or "Use PostgreSQL instead")
The AI updates the spec while maintaining structure

Performance Tips

Use larger context windows for complex projects. Models with 128K+ context (like GPT-4o) can handle more detailed requirements without truncation. For smaller models like Llama 3.2 3B (8K context), reduce LLM_MAX_TOKENS to leave room for prompts.
Lower LLM_TEMPERATURE (e.g., 0.3–0.5) for more consistent, structured specifications. Raise it slightly (e.g., 0.7–0.9) for more creative architectural suggestions.
Provide detailed answers to clarifying questions. The more context you provide, the more accurate and comprehensive the generated specification will be.
Use the refinement feature iteratively. Start with a basic spec, then refine specific sections (e.g., "Add Redis caching layer", "Switch to PostgreSQL") rather than regenerating from scratch.
On Apple Silicon, always run Ollama natively — never inside Docker. The MPS (Metal) GPU backend delivers significantly better throughput than CPU-only inference.
For enterprise deployments, choose a model optimized for long-form technical writing. GPT-4o and Claude Sonnet 3.5 excel at structured documentation.

Inference Metrics

The table below compares inference performance across different providers and models using a standardized SpecForge workload (3 runs: questions generation + spec generation with 1000 max output tokens).

Provider	Model	Deployment	Context Window	Avg Input Tokens	Avg Output Tokens	Avg Tokens / Request	P50 Latency (ms)	P95 Latency (ms)	Throughput (req/s)	Hardware
vLLM	`meta-llama/Llama-3.2-3B-Instruct`	Local	16.4K	4,155	1,197	5,352	108,068	124,953	0.011	Apple Silicon (Metal) (Macbook Pro M4)
Intel OPEA EI	`meta-llama/Llama-3.2-3B-Instruct`	Enterprise (On-Prem)	8.1K	4,158	823	4,982	33,911	38,391	0.035	CPU-only (Xeon)
OpenAI (Cloud)	`gpt-4o`	API (Cloud)	128K	4,018	875	4,893	13,540	24,892	0.074	N/A

Notes:

All metrics use identical SpecForge workflows: idea input → 5 questions → spec generation with LLM_MAX_TOKENS=1000.

Token counts are actual values from API responses (not estimates).

GPT-4o delivers 2.5x faster P50 latency and 2.1x better throughput compared to Llama 3.2 3B on the tested infrastructure.

Llama 3.2 3B performance is limited by CPU-only inference on the test gateway. Local GPU inference would significantly improve these numbers.

Model Capabilities

GPT-4o

OpenAI's flagship multimodal model, optimized for speed and intelligence across text and vision tasks.

Attribute	Details
Parameters	Not publicly disclosed
Architecture	Multimodal Transformer (text + image input, text output)
Context Window	128,000 tokens input / 16,384 tokens max output
Reasoning Mode	Standard inference with strong chain-of-thought reasoning
Tool / Function Calling	Supported; parallel function calling
Structured Output	JSON mode and strict JSON schema adherence supported
Multilingual	Broad multilingual support (50+ languages)
Benchmarks	Strong performance on system design, architectural decision-making, and technical documentation
Pricing	$2.50 / 1M input tokens, $10.00 / 1M output tokens (as of 2024)
Fine-Tuning	Supervised fine-tuning via OpenAI API
License	Proprietary (OpenAI Terms of Use)
Deployment	Cloud-only — OpenAI API or Azure OpenAI Service. No self-hosted option
Knowledge Cutoff	October 2023

Llama 3.2 3B Instruct

Meta's small-scale open-weight instruction-tuned model, designed for edge and on-premises deployment.

Attribute	Details
Parameters	3.21B total parameters
Architecture	Transformer decoder with Grouped Query Attention (GQA)
Context Window	131,072 tokens (128K) native
Reasoning Mode	Standard instruction-following (no explicit chain-of-thought mode)
Tool / Function Calling	Limited native support; can be prompted for structured output
Structured Output	JSON formatting supported via prompting
Multilingual	Primarily English-focused with limited multilingual capabilities
Benchmarks	MMLU: 63.4%, strong small-model performance for reasoning tasks
Quantization Formats	GGUF, GPTQ, AWQ — runs on consumer hardware (4GB+ RAM)
Inference Runtimes	Ollama, vLLM, llama.cpp, LMStudio, Transformers
Fine-Tuning	Full fine-tuning and LoRA adapters supported
License	Llama 3.2 Community License (open for research and commercial use)
Deployment	Local, on-prem, air-gapped, cloud — full data sovereignty

Comparison Summary

Capability	GPT-4o	Llama 3.2 3B Instruct
System design specifications	Excellent	Good
Architectural diagrams	Excellent	Good (requires careful prompting)
Technical documentation	Excellent	Good
Function / tool calling	Native support	Prompt-based
JSON structured output	Native with schema validation	Prompt-based
On-prem / air-gapped deployment	No	Yes
Data sovereignty	No (cloud API)	Full (weights run locally)
Open weights	No (proprietary)	Yes (Llama 3.2 License)
Custom fine-tuning	API-based only	Full fine-tuning + LoRA
Edge device deployment	N/A	Yes (quantized variants)
Multimodal (image input)	Yes	No
Native context window	128K	128K

Both models can generate system design specifications, though GPT-4o produces more comprehensive and detailed output with better architectural reasoning. Llama 3.2 3B excels in air-gapped environments, cost-sensitive deployments, and scenarios requiring data sovereignty.

LLM Provider Configuration

All providers are configured via the .env file. Set INFERENCE_PROVIDER=remote for any cloud or API-based provider, and INFERENCE_PROVIDER=ollama for local inference.

OpenAI

INFERENCE_PROVIDER=remote
INFERENCE_API_ENDPOINT=https://api.openai.com
INFERENCE_API_TOKEN=sk-...
INFERENCE_MODEL_NAME=gpt-4o

Recommended models: gpt-4o, gpt-4o-mini, gpt-4-turbo.

Groq

Groq provides OpenAI-compatible endpoints with extremely fast inference (LPU hardware).

INFERENCE_PROVIDER=remote
INFERENCE_API_ENDPOINT=https://api.groq.com/openai
INFERENCE_API_TOKEN=gsk_...
INFERENCE_MODEL_NAME=llama3-70b-8192

Recommended models: llama3-70b-8192, mixtral-8x7b-32768, llama-3.1-8b-instant.

Ollama

Runs inference locally on the host machine with full GPU acceleration.

Install Ollama: https://ollama.com/download

Pull a model:

# Production — best spec generation quality (~20 GB)
ollama pull codellama:34b

# Testing / SLM benchmarking (~4 GB, fast)
ollama pull codellama:7b

# Other strong code models
ollama pull deepseek-coder:6.7b
ollama pull qwen2.5-coder:7b
ollama pull codellama:13b

Confirm Ollama is running:
```
curl http://localhost:11434/api/tags
```

Configure .env:

INFERENCE_PROVIDER=ollama
INFERENCE_API_ENDPOINT=http://host.docker.internal:11434
INFERENCE_MODEL_NAME=codellama:34b
# INFERENCE_API_TOKEN is not required for Ollama

OpenRouter

OpenRouter provides a unified API across hundreds of models from different providers.

INFERENCE_PROVIDER=remote
INFERENCE_API_ENDPOINT=https://openrouter.ai/api
INFERENCE_API_TOKEN=sk-or-...
INFERENCE_MODEL_NAME=anthropic/claude-3.5-sonnet

Recommended models: anthropic/claude-3.5-sonnet, meta-llama/llama-3.1-70b-instruct, deepseek/deepseek-coder.

Custom OpenAI-Compatible API

Any enterprise gateway that exposes an OpenAI-compatible /v1/completions or /v1/chat/completions endpoint works without code changes.

GenAI Gateway (LiteLLM-backed):

INFERENCE_PROVIDER=remote
INFERENCE_API_ENDPOINT=https://genai-gateway.example.com
INFERENCE_API_TOKEN=your-litellm-master-key
INFERENCE_MODEL_NAME=codellama/CodeLlama-34b-Instruct-hf

If the endpoint uses a private domain mapped in /etc/hosts, also set:

LOCAL_URL_ENDPOINT=your-private-domain.internal

Switching Providers

Edit .env with the new provider's values.
Restart the backend container:
```
docker compose restart specforge-api
```

No rebuild is needed — all settings are injected at runtime via environment variables.

Environment Variables

All variables are defined in .env (copied from .env.example). The backend reads them at startup via python-dotenv.

Core LLM Configuration

Variable	Description	Default	Type
`INFERENCE_PROVIDER`	`remote` for any OpenAI-compatible API; `ollama` for local inference	`remote`	string
`INFERENCE_API_ENDPOINT`	Base URL of the inference service (no `/v1` suffix)	—	string
`INFERENCE_API_TOKEN`	Bearer token / API key. Not required for Ollama	—	string
`INFERENCE_MODEL_NAME`	Model identifier passed to the API	`gpt-4o`	string

Generation Parameters

Variable	Description	Default	Type
`LLM_TEMPERATURE`	Sampling temperature. Lower = more deterministic output (0.0–2.0)	`0.7`	float
`LLM_MAX_TOKENS`	Maximum tokens in the generated output	`8000`	integer

Server Configuration

Variable	Description	Default	Type
`BACKEND_PORT`	Port the FastAPI server listens on	`8000`	integer
`CORS_ALLOW_ORIGINS`	Allowed CORS origins (comma-separated or `*`). Restrict in production	`["*"]`	string
`LOCAL_URL_ENDPOINT`	Private domain in `/etc/hosts` the container must resolve. Leave as `not-needed` if not applicable	`not-needed`	string
`VERIFY_SSL`	Set `false` only for environments with self-signed certificates	`true`	boolean

Technology Stack

Backend

Framework: FastAPI (Python 3.11+) with Uvicorn ASGI server
LLM Integration: openai Python SDK — works with any OpenAI-compatible endpoint (remote or Ollama)
Local Inference: Ollama — runs natively on host with full Metal (MPS) or CUDA GPU acceleration
Config Management: python-dotenv for environment variable injection at startup
Data Validation: Pydantic v2 for request/response schema enforcement

Frontend

Framework: React 18 with Vite (fast HMR and production bundler)
Styling: Tailwind CSS v3 with custom dark mode design
UI Features: Real-time streaming, markdown rendering, conversational refinement, dark mode

Troubleshooting

For detailed troubleshooting, see TROUBLESHOOTING.md.

Common Issues

Issue: Backend returns 503 or 500 on generate

# Check backend logs for error details
docker compose logs specforge-api

# Verify the inference endpoint and token are set correctly
grep INFERENCE .env

Confirm INFERENCE_API_ENDPOINT is reachable from your machine.
Verify INFERENCE_API_TOKEN is valid and has the correct permissions.

Issue: Ollama connection refused

# Confirm Ollama is running on the host
curl http://localhost:11434/api/tags

# If not running, start it
ollama serve

Issue: Ollama is slow / appears to be CPU-only

Ensure Ollama is running natively on the host, not inside Docker.
On macOS, verify the Ollama app is using MPS in Activity Monitor (GPU History).
See the Ollama section for correct setup.

Issue: SSL certificate errors

# In .env
VERIFY_SSL=false

# Restart the backend
docker compose restart specforge-api

Issue: Frontend cannot connect to API

# Verify both containers are running
docker compose ps

# Check CORS settings
grep CORS .env

Ensure CORS_ALLOW_ORIGINS includes the frontend origin (e.g., http://localhost:3000).

Issue: Private domain not resolving inside container

Set LOCAL_URL_ENDPOINT=your-private-domain.internal in .env — this adds the host-gateway mapping for the container.

License

This project is licensed under our LICENSE file for details.

Disclaimer

SpecForge is provided as-is for demonstration and educational purposes. While we strive for accuracy:

AI-generated specifications should be reviewed by qualified engineers before use in production systems
Do not rely solely on AI-generated specifications without testing and validation
Do not submit confidential or proprietary information to third-party API providers without reviewing their data handling policies
The quality of generated specifications depends on the underlying model and may vary

For full disclaimer details, see DISCLAIMER.md.

FilesExpand file tree

README.md

Latest commit

History