diff --git a/README.ko.md b/README.ko.md index cd14baf..df108b9 100644 --- a/README.ko.md +++ b/README.ko.md @@ -22,20 +22,33 @@ --- -## 3줄로 시작하기 +## 빠른 시작 +**Ollama 스타일 CLI (v0.12.0+):** ```bash pip install quantcpp + +quantcpp pull llama3.2:1b # HuggingFace에서 다운로드 +quantcpp run llama3.2:1b # 대화형 채팅 +quantcpp serve llama3.2:1b -p 8080 # OpenAI 호환 HTTP 서버 +quantcpp list # 캐시된 모델 목록 +``` + +짧은 별칭: `smollm2:135m`, `qwen3.5:0.8b`, `llama3.2:1b`. `run`/`serve` 첫 실행 시 자동 다운로드. `serve`는 OpenAI 호환 `POST /v1/chat/completions` 엔드포인트를 8080 포트에 제공합니다. + +**한 줄 질문:** +```bash +quantcpp run llama3.2:1b "중력이란 무엇인가요?" ``` +**Python API (3줄):** ```python from quantcpp import Model - -m = Model.from_pretrained("Llama-3.2-1B") # 모델 자동 다운로드 (~750 MB) +m = Model.from_pretrained("Llama-3.2-1B") print(m.ask("중력이란 무엇인가요?")) ``` -API 키 없음. GPU 없음. 설정 없음. [브라우저에서 바로 체험 →](https://quantumaikr.github.io/quant.cpp/) · [**작동 원리 가이드 →**](https://quantumaikr.github.io/quant.cpp/guide/) +API 키 없음. GPU 없음. 설정 없음. 모델은 `~/.cache/quantcpp/`에 캐시됩니다. [브라우저에서 바로 체험 →](https://quantumaikr.github.io/quant.cpp/) · [**작동 원리 가이드 →**](https://quantumaikr.github.io/quant.cpp/guide/) --- diff --git a/README.md b/README.md index 824b33d..23944dd 100644 --- a/README.md +++ b/README.md @@ -37,27 +37,31 @@ ## Quick Start -**Terminal (one command):** +**Ollama-style CLI (v0.12.0+):** ```bash pip install quantcpp -quantcpp "What is gravity?" + +quantcpp pull llama3.2:1b # download from HuggingFace +quantcpp run llama3.2:1b # interactive chat +quantcpp serve llama3.2:1b -p 8080 # OpenAI-compatible HTTP server +quantcpp list # show cached models +``` + +Short aliases: `smollm2:135m`, `qwen3.5:0.8b`, `llama3.2:1b`. Auto-pulls on first `run`/`serve`. The `serve` subcommand exposes `POST /v1/chat/completions` (OpenAI-compatible) on port 8080. + +**One-shot question:** +```bash +quantcpp run llama3.2:1b "What is gravity?" ``` -**Python (3 lines):** +**Python API (3 lines):** ```python from quantcpp import Model m = Model.from_pretrained("Llama-3.2-1B") print(m.ask("What is gravity?")) ``` -**Interactive chat:** -```bash -quantcpp -# You: What is gravity? -# AI: Gravity is a fundamental force... -``` - -Downloads Llama-3.2-1B (~750 MB) on first use, cached locally. No API key, no GPU. [Try in browser →](https://quantumaikr.github.io/quant.cpp/) · [**How it works — Interactive Guide →**](https://quantumaikr.github.io/quant.cpp/guide/) +Downloads on first use, cached at `~/.cache/quantcpp/`. No API key, no GPU. [Try in browser →](https://quantumaikr.github.io/quant.cpp/) · [**Interactive Guide →**](https://quantumaikr.github.io/quant.cpp/guide/) --- diff --git a/site/index.html b/site/index.html index 69a6cce..745caf2 100644 --- a/site/index.html +++ b/site/index.html @@ -727,12 +727,25 @@
Three lines of Python. No GPU, no API key, no setup.
-pip install quantcpp
+ Ollama-style CLI. No GPU, no API key, no setup.
+
+
+ CLI (v0.12.0+)
+ pip install quantcpp
+
+quantcpp pull llama3.2:1b
+quantcpp run llama3.2:1b
+quantcpp serve llama3.2:1b -p 8080
+quantcpp list
+
+
+ Python API
+ from quantcpp import Model
-from quantcpp import Model
m = Model.from_pretrained("Llama-3.2-1B")
print(m.ask("What is gravity?"))
+
+
GitHub
PyPI
@@ -896,7 +909,9 @@ Try It Yourself
"glossary.gguf.term": "GGUF",
"glossary.gguf.def": "The standard file format for quantized LLM model weights, created by the llama.cpp project. quant.cpp loads GGUF models directly.",
"cta.title": "Try It Yourself",
- "cta.desc": "Three lines of Python. No GPU, no API key, no setup.",
+ "cta.desc": "Ollama-style CLI. No GPU, no API key, no setup.",
+ "cta.label.cli": "CLI (v0.12.0+)",
+ "cta.label.python": "Python API",
"rag.label": "Movement",
"rag.title": "Beyond RAG",
"rag.intro": "Traditional RAG splits documents into 512-token chunks, embeds them in a vector database, and retrieves fragments. This was a reasonable engineering compromise when LLMs had 2K context windows. Now they have 128K. The compromise should have started disappearing.",
@@ -1083,7 +1098,9 @@ Try It Yourself
"glossary.gguf.term": "GGUF",
"glossary.gguf.def": "\uC591\uC790\uD654\uB41C LLM \uBAA8\uB378 \uAC00\uC911\uCE58\uC758 \uD45C\uC900 \uD30C\uC77C \uD615\uC2DD. llama.cpp \uD504\uB85C\uC81D\uD2B8\uC5D0\uC11C \uB9CC\uB4E4\uC5C8\uC2B5\uB2C8\uB2E4. quant.cpp\uB294 GGUF \uBAA8\uB378\uC744 \uC9C1\uC811 \uB85C\uB4DC\uD569\uB2C8\uB2E4.",
"cta.title": "\uC9C1\uC811 \uD574\uBCF4\uAE30",
- "cta.desc": "Python 3\uC904. GPU\uB3C4, API \uD0A4\uB3C4, \uC124\uCE58\uB3C4 \uD544\uC694 \uC5C6\uC2B5\uB2C8\uB2E4.",
+ "cta.desc": "Ollama \uC2A4\uD0C0\uC77C CLI. GPU\uB3C4, API \uD0A4\uB3C4, \uC124\uCE58\uB3C4 \uD544\uC694 \uC5C6\uC2B5\uB2C8\uB2E4.",
+ "cta.label.cli": "CLI (v0.12.0+)",
+ "cta.label.python": "Python API",
"rag.label": "운동",
"rag.title": "Beyond RAG",
"rag.intro": "전통적인 RAG는 문서를 512토큰 청크로 나누고, 벡터 DB에 임베딩하고, 조각을 검색합니다. 이것은 LLM이 2K 컨텍스트만 가졌을 때 합리적인 엔지니어링 타협이었습니다. 지금은 128K입니다. 그 타협은 사라지기 시작했어야 합니다.",