Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 14 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Screenshot Translator (Gemma-4-E4B-It)
# Screenshot Translator (Gemma-4-26B-A4B-It)

<img width="640" height="521" alt="Image" src="https://github.com/user-attachments/assets/f24ef322-08b5-48e6-aa54-71b9e06d7401" />

Expand All @@ -17,9 +17,9 @@
- CUDA 対応 GPU (例: CUDA 13 / nvcc 13.0.88)
- `uv` (Python パッケージマネージャ) がホストにインストール済み
- 下の2つのモデルファイルをローカル `models/` に配置
- `models/gemma-4-E4B-it-UD-Q4_K_XL.gguf`(既定)
- `models/mmproj-F16.gguf`(既定)
- 配布元: [unsloth/gemma-4-E4B-it-GGUF](https://huggingface.co/unsloth/gemma-4-E4B-it-GGUF/tree/main)
- `models/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf`(既定)
- `models/mmproj-F16.gguf`(26B-A4B 用、既定)
- 配布元: [unsloth/gemma-4-26B-A4B-it-GGUF](https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF/tree/main)
- **音声読み上げ (TTS)**:
- バックエンド起動時に `Kokoro-82M` (約300MB) が自動でダウンロードされます。
- 音声再生のために、ホスト側に `libportaudio2` や `aplay` (ALSA) が必要です(Ubuntu Desktopなら通常は入っています)。
Expand All @@ -37,32 +37,36 @@
```bash
./start.sh
```
- デフォルト: Gemma 4 E4B (`UD-Q4_K_XL`) + `mmproj-F16`, llama-server 8009, Web UI 8012, ctx=8192, parallel=1。
- デフォルト: Gemma 4 26B-A4B (`UD-Q4_K_XL`) + `mmproj-F16`, llama-server 8009, Web UI 8012, ctx=8192, parallel=1。
- VRAMが少ない場合は起動時に `LLAMA_CTX` を下げて起動できます(例: `LLAMA_CTX=4096 ./start.sh`)。
- 既存の llama-server を使う場合: `SKIP_LLAMACPP=1 LLAMA_SERVER_URL=http://127.0.0.1:8009 ./start.sh`
- Gemma 4 既定時は `LLAMA_THINK_BUDGET=0` が自動適用されます。
- Gemma 4 既定時は `LLAMA_THINK_BUDGET=0` と `--reasoning off` が自動適用されます。
- E4B を併用する場合は、26B-A4B 用の `models/mmproj-F16.gguf` と名前が衝突しないよう、E4B 用 projector を任意の別名にして保存してください。例: `models/mmproj-F16_gemma4E4B.gguf`
- E4B を使う場合は `LLAMA_MODEL=models/gemma-4-E4B-it-UD-Q4_K_XL.gguf LLAMA_MMPROJ=models/mmproj-F16_gemma4E4B.gguf LLAMA_MODEL_NAME=Gemma-4-E4B-It ./start.sh` のように明示指定してください。
- Qwen3.5 を使う場合は `LLAMA_MODEL=models/Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf LLAMA_MMPROJ=models/mmproj-F32.gguf ./start.sh` のように明示指定してください。
- Qwen3.5 では Qwen3.5 用の `mmproj` を指定してください。Gemma 4 用の `mmproj` とは共用できません。ファイル名が衝突する場合は任意の別名で保存し、`LLAMA_MMPROJ` にそのパスを指定してください。

## 主な環境変数
- `WEB_PORT` (既定: 8012)
- `LLAMA_PORT` (既定: 8009)
- `LLAMA_MODEL` (既定: `models/gemma-4-E4B-it-UD-Q4_K_XL.gguf`)
- `LLAMA_MODEL` (既定: `models/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf`)
- `LLAMA_MMPROJ` (既定: `models/mmproj-F16.gguf`)
- `LLAMA_MODEL_NAME` (既定: `Gemma-4-E4B-It`)
- `LLAMA_MODEL_NAME` (既定: `Gemma-4-26B-A4B-It`)
- `LLAMA_CTX` (既定: 8192)
- `LLAMA_PARALLEL` (既定: 1)
- `LLAMA_BIN` (既定: ./llama.cpp/build/bin/llama-server)
- `LLAMA_CHAT_TEMPLATE_FILE` (`--chat-template-file` に渡すテンプレートパス)
- `LLAMA_REASONING` (`--reasoning` に渡す値。Gemma 4 系モデルでは未指定時に `off`)
- `LLAMA_THINK_BUDGET` (`--reasoning-budget` に渡す値。Gemma 4 / Qwen3.5 既定時は自動で `0`)
- `LLAMA_ARG_CHAT_TEMPLATE_FILE` / `LLAMA_ARG_THINK_BUDGET` も互換入力として受け付け
- `SKIP_LLAMACPP`=1 で llama-server 起動をスキップ

### Gemma 4 既定構成
- 既定構成は `gemma-4-E4B-it-UD-Q4_K_XL.gguf` と `mmproj-F16.gguf` です。
- 既定構成は `gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf` と `mmproj-F16.gguf` です。
- `mmproj-F16.gguf` は 26B-A4B 用 projector を想定しています。E4B を同居させる場合は、E4B 用 projector を別名で保存し、`LLAMA_MMPROJ` で明示指定してください。
- chat template は Gemma 4 のモデル内蔵 template をそのまま使います。
- 単一ユーザー前提で `--parallel 1` を既定にしています。
- thinking を抑制するため、`LLAMA_THINK_BUDGET` 未指定時は `0` を自動適用します。
- thinking を抑制するため、Gemma 4 系モデルでは `LLAMA_THINK_BUDGET` 未指定時は `0`、`LLAMA_REASONING` 未指定時は `off` を自動適用します。

### Qwen3.5 テンプレート運用(互換)
- 追跡対象テンプレートは `app/chat_templates/qwen3.5-35b-a3b.chat_template.jinja` です。
Expand Down
2 changes: 1 addition & 1 deletion app/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ class Settings:
def __init__(self) -> None:
self.api_base = os.getenv("LLAMA_SERVER_URL", "http://127.0.0.1:8009")
self.ctx_size = int(os.getenv("LLAMA_CTX", "8192"))
self.model_name = os.getenv("LLAMA_MODEL_NAME", "Gemma-4-E4B-It")
self.model_name = os.getenv("LLAMA_MODEL_NAME", "Gemma-4-26B-A4B-It")
self.system_prompt = (
"You are a precise OCR + translation engine."
" Output the FULL text exactly as seen."
Expand Down
84 changes: 63 additions & 21 deletions app/llama_client.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,11 +70,18 @@ async def ocr_translate_with_grounding(
clean_png: bytes,
return_roi_fallback: bool,
timeout_sec: int,
extra_instruction: Optional[str] = None,
translation_only: bool = False,
) -> str:
# Check if guide and clean are identical (Monitor Mode / optimized client)
is_single_image = (guide_png == clean_png)

prompt_text = self._build_grounding_prompt(return_roi_fallback, is_single_image)
prompt_text = self._build_grounding_prompt(
return_roi_fallback,
is_single_image,
extra_instruction=extra_instruction,
translation_only=translation_only,
)
clean_b64 = base64.b64encode(clean_png).decode()

content_list = [{"type": "text", "text": prompt_text}]
Expand Down Expand Up @@ -148,7 +155,12 @@ async def get_status(self) -> str:


@staticmethod
def _build_grounding_prompt(return_roi_fallback: bool, is_single_image: bool = False) -> str:
def _build_grounding_prompt(
return_roi_fallback: bool,
is_single_image: bool = False,
extra_instruction: Optional[str] = None,
translation_only: bool = False,
) -> str:
if is_single_image:
base = (
"You will receive an image.\n"
Expand All @@ -158,14 +170,18 @@ def _build_grounding_prompt(return_roi_fallback: bool, is_single_image: bool = F
"2) Scan the entire image area from top-left to bottom-right. Do not miss any independent text blocks.\n"
"3) Include all text columns, headers, and footers. Do not focus only on the main body.\n"
"4) Preserve code blocks and inline code verbatim; do NOT translate code.\n"
"5) Do NOT summarize or omit any content. Translate every line faithfully.\n"
"6) If a character is unreadable, use [UNK].\n\n"
"5) Do NOT summarize or omit any content. Translate every natural-language line faithfully.\n"
"6) Do not merge, reorder, or duplicate lines or sections.\n"
"7) If a line mixes natural language with code-like tokens, translate only the natural-language parts and keep the code-like parts unchanged.\n"
"8) If unsure whether text should be translated or preserved verbatim, prefer preserving the original text.\n"
"9) If a character is unreadable, use [UNK].\n\n"
"Tasks:\n"
"1) Treat the WHOLE image as the target.\n"
"2) Return ONE bounding box that covers ALL text in the image. Do not create a partial box.\n"
"3) OCR all text inside the image EXACTLY as visible.\n"
"4) すべての内容を日本語に正確に翻訳してください。なお、コードはそのまま出力してください。\n"
"5) If the image is ambiguous or text is unreadable, still return best-effort bbox and mark uncertainty in notes.\n"
"5) Do not duplicate lines, paragraphs, or sections.\n"
"6) If the image is ambiguous or text is unreadable, still return best-effort bbox and mark uncertainty in notes.\n"
"Output STRICTLY as JSON (no markdown, no extra text).\n"
)
else:
Expand All @@ -178,33 +194,59 @@ def _build_grounding_prompt(return_roi_fallback: bool, is_single_image: bool = F
"2) Scan the entire image area from top-left to bottom-right. Do not miss any independent text blocks.\n"
"3) Include all text columns, headers, and footers. Do not focus only on the main body.\n"
"4) Preserve code blocks and inline code verbatim; do NOT translate code.\n"
"5) Do NOT summarize or omit any content. Translate every line faithfully.\n"
"6) If a character is unreadable, use [UNK].\n\n"
"5) Do NOT summarize or omit any content. Translate every natural-language line faithfully.\n"
"6) Do not merge, reorder, or duplicate lines or sections.\n"
"7) If a line mixes natural language with code-like tokens, translate only the natural-language parts and keep the code-like parts unchanged.\n"
"8) If unsure whether text should be translated or preserved verbatim, prefer preserving the original text.\n"
"9) If a character is unreadable, use [UNK].\n\n"
"Tasks:\n"
"1) The guide stroke (Image A) indicates that the USER SELECTED THE ENTIRE IMAGE AREA. Treat the whole image as the target.\n"
"2) Return ONE bounding box that covers ALL text in the image. Do not create a partial box.\n"
"3) OCR all text inside the image EXACTLY as visible.\n"
"4) すべての内容を日本語に正確に翻訳してください。なお、コードはそのまま出力してください。\n"
"5) If the box is ambiguous or text is unreadable, still return best-effort bbox and mark uncertainty in notes.\n"
"5) Do not duplicate lines, paragraphs, or sections.\n"
"6) If the box is ambiguous or text is unreadable, still return best-effort bbox and mark uncertainty in notes.\n"
"Output STRICTLY as JSON (no markdown, no extra text).\n"
)

schema = (
'{\n'
' "target_bbox": {"x1": <int>, "y1": <int>, "x2": <int>, "y2": <int>},\n'
' "detected_language": "<string>",\n'
' "ocr_text": "<string>",\n'
' "ja_translation": "<string>",\n'
' "confidence": <number between 0 and 1>,\n'
' "notes": "<string>"\n'
'}'
)
if return_roi_fallback:
schema = schema[:-2] + ',\n "roi_fallback": {"ocr_text": "<string>", "ja_translation": "<string>"}\n}'
if extra_instruction:
base += (
"\nAdditional user instruction:\n"
f"{extra_instruction.strip()}\n"
"Follow this instruction unless it conflicts with the required output format.\n"
)

if translation_only:
schema = (
'{\n'
' "ja_translation": "<string>",\n'
' "notes": "<string>"\n'
'}'
)
else:
schema = (
'{\n'
' "target_bbox": {"x1": <int>, "y1": <int>, "x2": <int>, "y2": <int>},\n'
' "detected_language": "<string>",\n'
' "ocr_text": "<string>",\n'
' "ja_translation": "<string>",\n'
' "confidence": <number between 0 and 1>,\n'
' "notes": "<string>"\n'
'}'
)
if return_roi_fallback:
schema = schema[:-2] + ',\n "roi_fallback": {"ocr_text": "<string>", "ja_translation": "<string>"}\n}'

# Schema is same for both

if return_roi_fallback:
if translation_only:
base += (
"\nFor this request, you only need to return the final Japanese translation. "
"Do not include OCR text, bounding boxes, detected language, or any extra fields. "
"Put the full translated text in ja_translation. "
"Preserve the original line breaks and keep any code-like lines verbatim.\n"
)
elif return_roi_fallback:
# Roi fallback text also needs slight adjustment or is it generic?
# "If roi_fallback is requested, it must contain the FULL OCR... of the entire ROI (Image B)"
# In single image mode, there is only "The Image".
Expand Down
Loading
Loading