Skip to content

Aid-On/llm-throttle

Repository files navigation

@aid-on/llm-throttle

Precise dual rate limiting for LLM APIs (RPM + TPM)

Overview

@aid-on/llm-throttle is a high-precision rate limiting library specialized for LLM API calls. It simultaneously controls both RPM (Requests Per Minute) and TPM (Tokens Per Minute) to achieve efficient API usage.

Features

  • Dual Rate Limiting: Simultaneously manages both RPM and TPM
  • Token Bucket Algorithm: Smoothed rate limiting with burst handling
  • Real-time Adjustment: Post-adjustment based on actual token consumption
  • Detailed Metrics: Usage visualization and efficiency tracking
  • Full TypeScript Support: Type-safe development experience
  • Zero Dependencies: Lightweight design with no external library dependencies

Installation

npm install @aid-on/llm-throttle

Basic Usage

import { LLMThrottle } from '@aid-on/llm-throttle';

// Configure rate limits
const limiter = new LLMThrottle({
  rpm: 60,     // 60 requests per minute
  tpm: 10000   // 10,000 tokens per minute
});

// Check before request
const requestId = 'unique-request-id';
const estimatedTokens = 1500;

if (limiter.consume(requestId, estimatedTokens)) {
  // Execute API call
  const response = await callLLMAPI();
  
  // Adjust with actual token usage
  const actualTokens = response.usage.total_tokens;
  limiter.adjustConsumption(requestId, actualTokens);
} else {
  console.log('Rate limit reached');
}

Advanced Usage

Burst Limit Configuration

const limiter = new LLMThrottle({
  rpm: 60,
  tpm: 10000,
  burstRPM: 120,    // Allow up to 120 requests in short bursts
  burstTPM: 20000   // Allow up to 20,000 tokens in short bursts
});

Error Handling

import { RateLimitError } from '@aid-on/llm-throttle';

try {
  limiter.consumeOrThrow(requestId, estimatedTokens);
  // API call processing
} catch (error) {
  if (error instanceof RateLimitError) {
    console.log(`Limit reason: ${error.reason}`);
    console.log(`Available in: ${error.availableIn}ms`);
  }
}

Getting Metrics

const metrics = limiter.getMetrics();

console.log('RPM usage:', metrics.rpm.percentage + '%');
console.log('TPM usage:', metrics.tpm.percentage + '%');
console.log('Average tokens/request:', metrics.consumptionHistory.averageTokensPerRequest);
console.log('Estimation accuracy:', metrics.efficiency);

Pre-check

const check = limiter.canProcess(estimatedTokens);

if (check.allowed) {
  // Can process
  limiter.consume(requestId, estimatedTokens);
} else {
  console.log(`Limit reason: ${check.reason}`);
  console.log(`Available in: ${check.availableIn}ms`);
}

API Reference

LLMThrottle

Constructor

new LLMThrottle(config: DualRateLimitConfig)

Methods

  • canProcess(estimatedTokens: number): RateLimitCheckResult - Check if processing is possible
  • consume(requestId: string, estimatedTokens: number, metadata?: Record<string, unknown>): boolean - Consume tokens
  • consumeOrThrow(requestId: string, estimatedTokens: number, metadata?: Record<string, unknown>): void - Throw error on consumption failure
  • adjustConsumption(requestId: string, actualTokens: number): void - Adjust with actual consumption
  • getMetrics(): RateLimitMetrics - Get usage metrics
  • getConsumptionHistory(): ConsumptionRecord[] - Get consumption history
  • reset(): void - Reset limit state
  • setHistoryRetention(ms: number): void - Set history retention period

Type Definitions

interface DualRateLimitConfig {
  rpm: number;
  tpm: number;
  burstRPM?: number;
  burstTPM?: number;
  clock?: () => number;
}

interface RateLimitCheckResult {
  allowed: boolean;
  reason?: 'rpm_limit' | 'tpm_limit';
  availableIn?: number;
  availableTokens?: {
    rpm: number;
    tpm: number;
  };
}

interface RateLimitMetrics {
  rpm: {
    used: number;
    available: number;
    limit: number;
    percentage: number;
  };
  tpm: {
    used: number;
    available: number;
    limit: number;
    percentage: number;
  };
  efficiency: number;
  consumptionHistory: {
    count: number;
    averageTokensPerRequest: number;
    totalTokens: number;
  };
}

Practical Examples

Integration with OpenAI API

import OpenAI from 'openai';
import { LLMThrottle } from '@aid-on/llm-throttle';

const openai = new OpenAI();
const limiter = new LLMThrottle({
  rpm: 500,    // Example OpenAI Tier 1 limits
  tpm: 10000
});

async function chatCompletion(messages: any[], requestId: string) {
  const estimatedTokens = estimateTokens(messages); // Custom estimation logic
  
  if (!limiter.consume(requestId, estimatedTokens)) {
    throw new Error('Rate limit reached');
  }
  
  try {
    const response = await openai.chat.completions.create({
      model: 'gpt-3.5-turbo',
      messages
    });
    
    // Adjust with actual usage
    const actualTokens = response.usage?.total_tokens || estimatedTokens;
    limiter.adjustConsumption(requestId, actualTokens);
    
    return response;
  } catch (error) {
    // Return estimated value on error
    limiter.adjustConsumption(requestId, 0);
    throw error;
  }
}

Multi-service Integration

class APIManager {
  private limiters = new Map<string, LLMThrottle>();
  
  constructor() {
    // Service-specific limit configuration
    this.limiters.set('openai', new LLMThrottle({
      rpm: 500, tpm: 10000
    }));
    this.limiters.set('anthropic', new LLMThrottle({
      rpm: 1000, tpm: 20000
    }));
  }
  
  async callAPI(service: string, requestId: string, estimatedTokens: number) {
    const limiter = this.limiters.get(service);
    if (!limiter) throw new Error(`Unknown service: ${service}`);
    
    const check = limiter.canProcess(estimatedTokens);
    if (!check.allowed) {
      throw new RateLimitError(
        `Rate limit exceeded for ${service}: ${check.reason}`,
        check.reason!,
        check.availableIn!
      );
    }
    
    limiter.consume(requestId, estimatedTokens);
    // API call processing...
  }
}

Testing

npm test

License

MIT License

About

Precise rate limiting for LLM APIs (RPM + TPM) with TypeScript support

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages