Skip to content

Getting Started

DeVenLucaz edited this page Jun 19, 2026 · 4 revisions

This page walks through exactly what happens from your first llamdrop command to your first AI conversation.


Step 1 — First Launch

Type llamdrop and press Enter. On your very first launch, llamdrop runs a Dynamic Backend Probing micro-benchmark and shows a Device Profile card. This is a one-time welcome screen that tells you exactly what it detected about your hardware:

  • RAM total and available
  • CPU chip name (80+ Android SoC names translated, e.g. MT6853V → Dimensity 720)
  • CPU core layout (big.LITTLE aware — only fast cores are used)
  • CPU flags (AVX2, AVX512, NEON)
  • GPU vendor and whether GPU acceleration is active
  • Storage free space
  • OS and architecture
  • Your device tier (Micro / Low / Low-Mid / Mid / High)
  • Which backend will be used (llama.cpp CPU, Vulkan, CUDA, ROCm, or Ollama)
  • Recommended starting models for your hardware

This screen runs once and only once. Hardware detection at launch takes about 2 seconds, runs exactly once, and the result is used everywhere.


Step 2 — Main Menu

Navigate with arrow keys, select with Enter.

Option What it does
🚀 Start chatting Launch a downloaded model
⬇️ Browse & download Verified model catalog — filtered to your device tier and RAM
🔎 Search HuggingFace Search any GGUF model live
📂 My downloaded models View models you have + scan phone for existing GGUFs
💾 Resume saved session Continue a previous conversation
🔧 Device info Your full hardware profile, tier, and backend decision
🩺 Doctor Health check — diagnose and auto-fix install issues
⚙️ Config View your settings and their source
🆙 Update llamdrop Pull latest version from GitHub
⚙️ Update Engine Update the underlying llama.cpp binary separately
🌐 Language / भाषा Change display language (English, Hindi, Spanish, Portuguese, Arabic)
❓ Help Quick reference

The menu header shows live RAM (colour-coded green/yellow/red), battery level, llama.cpp status, GPU status, and any available updates.


Step 3 — Browse and Pick a Model

Select Browse & download. You'll see a list filtered to models that fit your device tier and current RAM. Models that are too small to be useful on your hardware, or too large to ever run, are hidden. Each entry shows:

  • Model name, parameter count, quantization level
  • Download size and RAM required
  • Fit status: Great fit / Good fit / Tight on RAM
  • ⚡ t/s benchmark score (shown after your first chat with that model, rolling average of last 5 runs)

Filters:

  • Press [C] to filter by category: chat, coding, reasoning, multilingual, etc.
  • Press [P] to filter by Provider (e.g. Meta, Qwen, Microsoft, Google).
  • Press [U] to toggle hiding/showing Unsupported Models (power users only).

Recommended starting points:

  • Under 2GB available RAM: Qwen2.5 0.5B or SmolLM2 1.7B
  • 2–4GB available RAM: Qwen2.5 1.5B Q4_K_M — good quality, stable
  • 4–8GB available RAM: Mistral 7B or Llama 3.1 8B
  • 8GB+ available RAM: Gemma 3 12B, Qwen3 8B, or better

Step 4 — Download

Press Enter on a model. llamdrop:

  1. Re-checks your live RAM right now and picks the best quantization for what actually fits (Q5_K_M → Q4_K_M → IQ3_M → Q2_K, depending on available RAM)
  2. Shows download size and reason for the choice
  3. Downloads with a progress bar — auto-resumes if connection drops
  4. Verifies the file via SHA-256 checksum after download to catch corruption
  5. If you cancel mid-download (Ctrl+C), the partial file is deleted immediately — it will never show up as a valid model

Step 5 — Chat

After download, press Enter to launch. You'll see the chat header with model name, RAM free, and context size.

You: hello

🦙  Hello! How can I help you today?

🟢 RAM: 2.4GB free
You: 

Chat commands

Command Action
/save Save conversation to resume later
/export Export chat as markdown to Downloads folder
/trim Manually trim old context to free RAM
/clear Clear conversation history
/ram Show current RAM
/quit Exit chat

Automatic & Advanced features

  • Context trimming — auto-trims when RAM gets low. Always preserves your first exchange — only the middle of long conversations gets trimmed, never your opening task or persona.
  • RAM monitor — background thread watches RAM during inference, triggers trim if critical
  • Battery tracking — shows % drop per response on Android with charge-level icons (🪫 🔴 🟡 🔋)
  • Session auto-save — saves every 5 exchanges (10 messages) automatically
  • File context — attach a file (TXT, MD, PDF, CSV, JSON) to your conversation before chatting to chat with your document.
  • Doctor Auto-healingllamdrop doctor now includes an auto-fix [F] key to automatically repair missing binaries, broken configs, and corrupted directories.

Ollama backend (Linux)

If you have Ollama running, llamdrop auto-detects it and routes inference through Ollama's HTTP API. An Ollama Chat option appears in the menu.


Step 6 — Resume Sessions

Go to Resume saved session. Select a session number to resume, or type D2 to delete session 2. The model loads with full conversation history as context.


Next Steps

🦙 LLAMdrop Wiki

📂 Resource Center

🆘 Support & Plans


Tip: Running on budget hardware? Check the Model Catalog for Tier 1 models.

Clone this wiki locally