Skip to content

Latest commit

Β 

History

History
397 lines (288 loc) Β· 9.7 KB

File metadata and controls

397 lines (288 loc) Β· 9.7 KB

BotModels - AI Inference Service

Version: 1.0.0
Purpose: Multimodal AI inference service for General Bots


Overview

BotModels is a Python-based AI inference service that provides multimodal capabilities to the General Bots platform. It serves as a companion to botserver (Rust), specializing in cutting-edge AI/ML models from the Python ecosystem including image generation, video creation, speech synthesis, and vision/captioning.

While botserver handles business logic, networking, and systems-level operations, BotModels exists solely to leverage the extensive Python AI/ML ecosystem for inference tasks that are impractical to implement in Rust.

For comprehensive documentation, see docs.pragmatismo.com.br or the BotBook for detailed guides, API references, and tutorials.


Features

  • Image Generation: Generate images from text prompts using Stable Diffusion
  • Video Generation: Create short videos from text descriptions using Zeroscope
  • Speech Synthesis: Text-to-speech using Coqui TTS
  • Speech Recognition: Audio transcription using OpenAI Whisper
  • Vision/Captioning: Image and video description using BLIP2

Quick Start

Installation

# Clone the repository
cd botmodels

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Linux/Mac
# or
.\venv\Scripts\activate  # Windows

# Install dependencies
pip install -r requirements.txt

Configuration

Copy the example environment file and configure:

cp .env.example .env

Edit .env with your settings:

HOST=0.0.0.0
PORT=8085
API_KEY=your-secret-key
DEVICE=cuda
IMAGE_MODEL_PATH=./models/stable-diffusion-v1-5
VIDEO_MODEL_PATH=./models/zeroscope-v2
VISION_MODEL_PATH=./models/blip2

Running the Server

# Development mode
python -m uvicorn src.main:app --host 0.0.0.0 --port 8085 --reload

# Production mode
python -m uvicorn src.main:app --host 0.0.0.0 --port 8085 --workers 4

# With HTTPS (production)
python -m uvicorn src.main:app --host 0.0.0.0 --port 8085 --ssl-keyfile key.pem --ssl-certfile cert.pem

🐍 Philosophy & Scope

Why Python?

  • Rust vs. Python Rule:
    • If logic is deterministic, systems-level, or performance-critical: Do it in Rust (botserver)
    • If logic requires cutting-edge ML models, rapid experimentation with HuggingFace, or specific Python-only libraries: Do it here

Architecture Principles

  • Inference Only: This service should NOT hold business state. It accepts inputs, runs inference, and returns predictions.
  • Stateless: Treated as a sidecar to botserver.
  • API First: Exposes strict HTTP/REST endpoints consumed by botserver.

πŸ›  Technology Stack

  • Runtime: Python 3.10+
  • Web Framework: FastAPI (preferred over Flask for async/performance)
  • ML Frameworks: PyTorch, HuggingFace Transformers, Diffusers
  • Quality: ruff (linting), black (formatting), mypy (typing)

πŸ“‘ API Endpoints

All endpoints require the X-API-Key header for authentication.

Image Generation

POST /api/image/generate
Content-Type: application/json
X-API-Key: your-api-key

{
  "prompt": "a cute cat playing with yarn",
  "steps": 30,
  "width": 512,
  "height": 512,
  "guidance_scale": 7.5,
  "seed": 42
}

Video Generation

POST /api/video/generate
Content-Type: application/json
X-API-Key: your-api-key

{
  "prompt": "a rocket launching into space",
  "num_frames": 24,
  "fps": 8,
  "steps": 50
}

Speech Generation (TTS)

POST /api/speech/generate
Content-Type: application/json
X-API-Key: your-api-key

{
  "prompt": "Hello, welcome to our service!",
  "voice": "default",
  "language": "en"
}

Speech to Text

POST /api/speech/totext
Content-Type: multipart/form-data
X-API-Key: your-api-key

file: <audio_file>

Image Description

POST /api/vision/describe
Content-Type: multipart/form-data
X-API-Key: your-api-key

file: <image_file>
prompt: "What is in this image?" (optional)

Video Description

POST /api/vision/describe_video
Content-Type: multipart/form-data
X-API-Key: your-api-key

file: <video_file>
num_frames: 8 (optional)

Visual Question Answering

POST /api/vision/vqa
Content-Type: multipart/form-data
X-API-Key: your-api-key

file: <image_file>
question: "How many people are in this image?"

Health Check

GET /api/health

Interactive API documentation:

  • Swagger UI: http://localhost:8085/api/docs
  • ReDoc: http://localhost:8085/api/redoc

πŸ”— Integration with BotServer

Configuration (config.csv)

key,value
botmodels-enabled,true
botmodels-host,0.0.0.0
botmodels-port,8085
botmodels-api-key,your-secret-key
botmodels-https,false
image-generator-model,../../../../data/diffusion/sd_turbo_f16.gguf
image-generator-steps,4
image-generator-width,512
image-generator-height,512
video-generator-model,../../../../data/diffusion/zeroscope_v2_576w
video-generator-frames,24
video-generator-fps,8

BASIC Script Keywords

// Generate an image
file = IMAGE "a beautiful sunset over mountains"
SEND FILE TO user, file

// Generate a video
video = VIDEO "waves crashing on a beach"
SEND FILE TO user, video

// Generate speech
audio = AUDIO "Welcome to General Bots!"
SEND FILE TO user, audio

// Get image/video description
caption = SEE "/path/to/image.jpg"
TALK caption

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     HTTPS      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  botserver  β”‚ ────────────▢  β”‚  botmodels  β”‚
β”‚   (Rust)    β”‚                β”‚  (Python)   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
      β”‚                              β”‚
      β”‚ BASIC Keywords               β”‚ AI Models
      β”‚ - IMAGE                      β”‚ - Stable Diffusion
      β”‚ - VIDEO                      β”‚ - Zeroscope
      β”‚ - AUDIO                      β”‚ - TTS/Whisper
      β”‚ - SEE                        β”‚ - BLIP2
      β–Ό                              β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   config    β”‚                β”‚   outputs   β”‚
β”‚   .csv      β”‚                β”‚  (files)    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

⚑️ Development Guidelines

Modern Model Usage

  • Deprecate Legacy: Move away from outdated libs (e.g., old allennlp) in favor of HuggingFace Transformers and Diffusers
  • Quantization: Always consider quantized models (bitsandbytes, GGUF) to reduce VRAM usage

Performance & Loading

  • Lazy Loading: Do NOT load 10GB models at module import time. Load on startup lifecycle or first request with locking
  • GPU Handling: Robustly detect CUDA/MPS (Mac) and fallback to CPU gracefully

Code Quality

  • Type Hints: All functions MUST have type hints
  • Error Handling: No bare except:. Catch precise exceptions and return structured JSON errors to botserver

Project Structure

botmodels/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ api/
β”‚   β”‚   β”œβ”€β”€ v1/
β”‚   β”‚   β”‚   └── endpoints/
β”‚   β”‚   β”‚       β”œβ”€β”€ image.py
β”‚   β”‚   β”‚       β”œβ”€β”€ video.py
β”‚   β”‚   β”‚       β”œβ”€β”€ speech.py
β”‚   β”‚   β”‚       └── vision.py
β”‚   β”‚   └── dependencies.py
β”‚   β”œβ”€β”€ core/
β”‚   β”‚   β”œβ”€β”€ config.py
β”‚   β”‚   └── logging.py
β”‚   β”œβ”€β”€ schemas/
β”‚   β”‚   └── generation.py
β”‚   β”œβ”€β”€ services/
β”‚   β”‚   β”œβ”€β”€ image_service.py
β”‚   β”‚   β”œβ”€β”€ video_service.py
β”‚   β”‚   β”œβ”€β”€ speech_service.py
β”‚   β”‚   └── vision_service.py
β”‚   └── main.py
β”œβ”€β”€ outputs/
β”œβ”€β”€ models/
β”œβ”€β”€ tests/
β”œβ”€β”€ requirements.txt
└── README.md

πŸ§ͺ Testing

pytest tests/

πŸ”’ Security

  1. Always use HTTPS in production
  2. Use strong, unique API keys
  3. Restrict network access to the service
  4. Consider running on a separate GPU server
  5. Monitor resource usage and set appropriate limits

πŸ“š Documentation

For complete documentation, guides, and API references:


πŸ“¦ Requirements

  • Python 3.10+
  • CUDA-capable GPU (recommended, 8GB+ VRAM)
  • 16GB+ RAM

πŸ”— Resources

Education

References

Community


πŸ”‘ Remember

  • Inference Only: No business state, just predictions
  • Modern Models: Use HuggingFace Transformers, Diffusers
  • Type Safety: All functions must have type hints
  • Lazy Loading: Don't load models at import time
  • GPU Detection: Graceful fallback to CPU
  • Version 1.0.0 - Do not change without approval
  • GIT WORKFLOW - ALWAYS push to ALL repositories (github, pragmatismo)

πŸ“„ License

See LICENSE file for details.