Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
98 changes: 98 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,12 @@ An AI-powered application that generates comprehensive system design specificati
- [Local Development Setup](#local-development-setup)
- [Project Structure](#project-structure)
- [Usage Guide](#usage-guide)
- [Performance Tips](#performance-tips)
- [Inference Benchmarks](#inference-benchmarks)
- [Model Capabilities](#model-capabilities)
- [GPT-4o](#gpt-4o)
- [Llama 3.2 3B Instruct](#llama-32-3b-instruct)
- [Comparison Summary](#comparison-summary)
- [LLM Provider Configuration](#llm-provider-configuration)
- [OpenAI](#openai)
- [Groq](#groq)
Expand Down Expand Up @@ -304,6 +310,98 @@ SpecForge/

---

## Performance Tips

- **Use larger context windows for complex projects.** Models with 128K+ context (like GPT-4o) can handle more detailed requirements without truncation. For smaller models like Llama 3.2 3B (8K context), reduce `LLM_MAX_TOKENS` to leave room for prompts.
- **Lower `LLM_TEMPERATURE`** (e.g., `0.3–0.5`) for more consistent, structured specifications. Raise it slightly (e.g., `0.7–0.9`) for more creative architectural suggestions.
- **Provide detailed answers to clarifying questions.** The more context you provide, the more accurate and comprehensive the generated specification will be.
- **Use the refinement feature iteratively.** Start with a basic spec, then refine specific sections (e.g., "Add Redis caching layer", "Switch to PostgreSQL") rather than regenerating from scratch.
- **On Apple Silicon**, always run Ollama natively — never inside Docker. The MPS (Metal) GPU backend delivers significantly better throughput than CPU-only inference.
- **For enterprise deployments**, choose a model optimized for long-form technical writing. GPT-4o and Claude Sonnet 3.5 excel at structured documentation.

---

## Inference Benchmarks

The table below compares inference performance across different providers and models using a standardized SpecForge workload (3 runs: questions generation + spec generation with 1000 max output tokens).

| Provider | Model | Deployment | Context Window | Avg Input Tokens | Avg Output Tokens | Avg Tokens / Request | P50 Latency (ms) | P95 Latency (ms) | Throughput (req/s) | Hardware |
| -------------- | ------------------------------ | -------------------- | -------------- | ---------------- | ----------------- | -------------------- | ---------------- | ---------------- | ------------------ | ---------------- |
| OpenAI (Cloud) | `gpt-4o` | API (Cloud) | 128K | 4,018 | 875 | 4,893 | 13,540 | 24,892 | 0.074 | Cloud GPUs |
| LiteLLM | `meta-llama/Llama-3.2-3B-Instruct` | Enterprise Gateway | 8.1K | 4,158 | 823 | 4,982 | 33,911 | 38,391 | 0.035 | CPU (Xeon) |

> **Notes:**
>
> - All benchmarks use identical SpecForge workflows: idea input → 5 questions → spec generation with `LLM_MAX_TOKENS=1000`.
> - Token counts are actual values from API responses (not estimates).
> - GPT-4o delivers 2.5x faster P50 latency and 2.1x better throughput compared to Llama 3.2 3B on the tested infrastructure.
> - Llama 3.2 3B performance is limited by CPU-only inference on the test gateway. Local GPU inference would significantly improve these numbers.

---

## Model Capabilities

### GPT-4o

OpenAI's flagship multimodal model, optimized for speed and intelligence across text and vision tasks.

| Attribute | Details |
| --------------------------- | --------------------------------------------------------------------------------- |
| **Parameters** | Not publicly disclosed |
| **Architecture** | Multimodal Transformer (text + image input, text output) |
| **Context Window** | 128,000 tokens input / 16,384 tokens max output |
| **Reasoning Mode** | Standard inference with strong chain-of-thought reasoning |
| **Tool / Function Calling** | Supported; parallel function calling |
| **Structured Output** | JSON mode and strict JSON schema adherence supported |
| **Multilingual** | Broad multilingual support (50+ languages) |
| **Benchmarks** | Strong performance on system design, architectural decision-making, and technical documentation |
| **Pricing** | $2.50 / 1M input tokens, $10.00 / 1M output tokens (as of 2024) |
| **Fine-Tuning** | Supervised fine-tuning via OpenAI API |
| **License** | Proprietary (OpenAI Terms of Use) |
| **Deployment** | Cloud-only — OpenAI API or Azure OpenAI Service. No self-hosted option |
| **Knowledge Cutoff** | October 2023 |

### Llama 3.2 3B Instruct

Meta's small-scale open-weight instruction-tuned model, designed for edge and on-premises deployment.

| Attribute | Details |
| --------------------------- | ------------------------------------------------------------------------------------------------------------------- |
| **Parameters** | 3.21B total parameters |
| **Architecture** | Transformer decoder with Grouped Query Attention (GQA) |
| **Context Window** | 131,072 tokens (128K) native |
| **Reasoning Mode** | Standard instruction-following (no explicit chain-of-thought mode) |
| **Tool / Function Calling** | Limited native support; can be prompted for structured output |
| **Structured Output** | JSON formatting supported via prompting |
| **Multilingual** | Primarily English-focused with limited multilingual capabilities |
| **Benchmarks** | MMLU: 63.4%, strong small-model performance for reasoning tasks |
| **Quantization Formats** | GGUF, GPTQ, AWQ — runs on consumer hardware (4GB+ RAM) |
| **Inference Runtimes** | Ollama, vLLM, llama.cpp, LMStudio, Transformers |
| **Fine-Tuning** | Full fine-tuning and LoRA adapters supported |
| **License** | Llama 3.2 Community License (open for research and commercial use) |
| **Deployment** | Local, on-prem, air-gapped, cloud — full data sovereignty |

### Comparison Summary

| Capability | GPT-4o | Llama 3.2 3B Instruct |
| ------------------------------- | -------------------------------- | -------------------------------- |
| System design specifications | Excellent | Good |
| Architectural diagrams | Excellent | Good (requires careful prompting)|
| Technical documentation | Excellent | Good |
| Function / tool calling | Native support | Prompt-based |
| JSON structured output | Native with schema validation | Prompt-based |
| On-prem / air-gapped deployment | No | Yes |
| Data sovereignty | No (cloud API) | Full (weights run locally) |
| Open weights | No (proprietary) | Yes (Llama 3.2 License) |
| Custom fine-tuning | API-based only | Full fine-tuning + LoRA |
| Edge device deployment | N/A | Yes (quantized variants) |
| Multimodal (image input) | Yes | No |
| Native context window | 128K | 128K |

> Both models can generate system design specifications, though GPT-4o produces more comprehensive and detailed output with better architectural reasoning. Llama 3.2 3B excels in air-gapped environments, cost-sensitive deployments, and scenarios requiring data sovereignty.

---

## LLM Provider Configuration

All providers are configured via the `.env` file. Set `INFERENCE_PROVIDER=remote` for any cloud or API-based provider, and `INFERENCE_PROVIDER=ollama` for local inference.
Expand Down
3 changes: 3 additions & 0 deletions backend/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,9 @@ FROM python:3.11-slim

WORKDIR /app

# Upgrade pip, setuptools, and wheel to fix security vulnerabilities
RUN pip install --no-cache-dir --upgrade pip setuptools>=79.1.0 wheel>=0.46.2

# Copy requirements first for better caching
COPY requirements.txt .

Expand Down
40 changes: 20 additions & 20 deletions frontend/package-lock.json

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

5 changes: 4 additions & 1 deletion frontend/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
"preview": "vite preview"
},
"dependencies": {
"mermaid": "^11.13.0",
"mermaid": "^11.14.0",
"react": "^19.2.4",
"react-dom": "^19.2.4",
"react-markdown": "^10.1.0",
Expand All @@ -26,5 +26,8 @@
"eslint-plugin-react-refresh": "^0.5.2",
"globals": "^17.4.0",
"vite": "^8.0.0"
},
"overrides": {
"lodash-es": "^4.18.0"
}
}
Loading