Chinese LLMs have reached the top tier in capabilities, yet in real production environments they face a lack of infrastructure. An unavoidable dilemma is:
When using OpenAI SDK/LiteLLM with vendor-provided compatible interfaces, unsupported native parameters are silently ignored, leading to uncontrollable results and missing features; using vendor proprietary SDKs requires extra field parsing and structure transformation. When workflows involve multiple models from different vendors, different code adaptations are needed for each model, resulting in increased engineering workload and maintenance costs.
CNLLM provides a unified OpenAI-compatible interface layer and a set of standardized parameter rules and response format specifications. CNLLM achieves bidirectional mapping of requests and responses through standardized YAML configuration files tailored for each vendor, mapping CNLLM standard parameters to vendor-accepted parameter names, passing through other native parameters, and finally automatically encapsulating heterogeneous model responses into OpenAI standard responses.
This implementation path uniformly defines CNLLM standard parameters, aligns with OpenAI standard response structures, preserves the complete capabilities of Chinese LLMs, and ensures scalability for integrating more vendors. Compared to OpenAI SDK and vendor proprietary SDKs, CNLLM also implements systematic enhancements for key field parsing, frontend streaming rendering, and engineering batch processing scenarios.
Through CNLLM, developers can seamlessly use Chinese LLMs in the OpenAI ecosystem — LangChain, LlamaIndex, AutoGen, Haystack, DeepEval and other mainstream large language model application frameworks. Especially in development and application scenarios requiring multi-model collaboration, using CNLLM can significantly reduce adaptation, parsing, feature implementation, and maintenance workload, and effectively lower token consumption in AI agent development.
- Unified Interface - One set of interfaces and parameters to call different Chinese LLMs, returns OpenAI API standard format
- Mainstream Framework Integration - Built-in integration with LangChain Runnable, and adaptable with other OpenAI-compatible frameworks.
- Parameter Observability - Validation and explicit feedback for all parameters, especially vendor-specific ones.
- Streaming Enrichment - Streaming lifecycle inspection via repr, and properties for automated accumulation.
- Batch Capability Enhancement - Independent configuration for single requests in batch tasks, real-time statistics, and cofigurable failure policy and memory management.
Welcome developers to participate in CNLLM's development. Please submit an Issue to discuss your solution before creating a Pull Request.
Or contact us at: wangkancheng1122@163.com
| Direction | Description |
|---|---|
| 🌐 New Vendor Adapters | Integrate more Chinese LLMs (Alibaba Qwen, Baidu Wenxin, Tencent Hunyuan, etc.) |
| 🔗 Framework Integration | Deepen integration with LlamaIndex, LiteLLM, and other frameworks |
| 🐛 Capability Expansion | Adapter framework development for multimodal capabilities |
| 📖 Documentation | Add use cases and improve development guides |
| 💡 Feature Suggestions | Share your ideas and requirements |
Project Documentation:
- ✨ New Vendors
- Qwen: qwen3.6/qwen3.5/qwen-plus/qwen-turbo/qwen-max and 13 models total + Embedding models
- Baidu: ernie-5.1/ernie-4.5/ernie-speed/ernie-lite/ernie-x1 and 11 models total + Embeddings models
- Hunyuan: hy3-preview/hunyuan-2.0-thinking/hunyuan-2.0-instruct
- ✨ LangChain Integration
LangChainRunnable(BaseChatModel)adds support forbind_tools()/with_structured_output()methods- New
LangChainEmbeddings: adaptslangchain_core.embeddings.Embeddings, supportsembed_documents()/embed_query()
- ✨ MiniMax Dual Interface Adaptation
- Added
MiniMaxNativeAdapter: CNLLM now adapts MiniMax models with both native and OpenAI-compatible dual interfaces - Native interface supports
stream_options,group_idvendor-specific pass-through parameters, and CNLLM returns OpenAI standard responses - In OpenAI-compatible interface, the
.thinkproperty extracts and accumulates model's thinking content in real-time
- Added
- 🔧 Framework Use Case Tests
- Added
tests/key_needed/frameworkdirectory, containing test cases for CNLLM integration with langchain, llamaindex, autogen, haystack, deepeval frameworks in production scenarios
- Added
- 🔧 Refactoring
- Removed
StreamChunks, merged intoStreamAccumulator - Removed seamless async support (
_SyncProxyand 5 other classes), now async clients must use async syntax StreamAccumulator._accumulate()caching,from_chunks()class method etc.
- Removed
- ✨
keepparameter — Storage Controlbatch()addskeepparameter to control persistent storage of batch response fields- All fields in batch responses can be accessed in real-time during iteration, with results updated and accumulated in real-time; after iteration, accessing fields not specified in
keepreturns empty container + warning - Default strategy (when
keepis not configured):chat.batch()responses default to keeping key fieldsstill/think/toolsand batch metadata, releasing other redundant fieldsembeddings.batch()responses default to keeping key fieldvectorsand batch metadata, releasing other redundant fields
- ✨
drop_paramsparameter — Unknown Parameter Handling Strategycreate()andbatch()adddrop_paramsparameter, supporting three-tier parameter handling strategies:drop_params="warn": warns that parameters are not taking effect, ignores and continues, default strategydrop_params="ignore": silently ignores unknown parameters and continues executiondrop_params="strict": throws exception, terminates request execution
- ✨
usagefield — Usage Statisticsbatch()response now includesusagefield, storing full Token consumption statistics for batch processing, accessed via.usage
- ✨ batch embeddings response format
embeddings.batch()response now includesvectorsfield, storing embedding vectors returned from batch requests, accessed via.vectorsembeddings.batch()response now includesbatch_infofield, storing batch metadata likebatch_size, accessed via.batch_info
- DeepSeek
deepseek-chat,deepseek-reasoner,deepseek-v4-pro,deepseek-v4-flash
- KIMI (Moonshot AI)
kimi-k2.6,kimi-k2.5,moonshot-v1-128k(moonshot-v1),moonshot-v1-8k,moonshot-v1-32k,moonshot-v1-vision-preview
- Doubao
doubao-seed-2-0-pro-260215(doubao-seed-2-0-pro),doubao-seed-2-0-mini-260215(doubao-seed-2-0-mini),doubao-seed-2-0-lite-260215(doubao-seed-2-0-lite),doubao-seed-2-0-code-preview-260215(doubao-seed-2-0-code),doubao-seed-1-8-251228(doubao-seed-1-8),doubao-seed-1-6-251015(doubao-seed-1-6),doubao-seed-1-6-flash-250828(doubao-seed-1-6-flash),doubao-seed-1-6-vision-250815(doubao-seed-1-6-vision),doubao-1-5-vision-pro-32k-250115(doubao-1-5-vision-pro),doubao-seed-1-5-lite-32k-250115(doubao-seed-1-5-lite),doubao-seed-1-5-pro-32k-250115(doubao-seed-1-5-pro-32k),doubao-seed-1-5-pro-256k-250115(doubao-seed-1-5-pro)
- GLM
glm-4.6,glm-4.7,glm-4.7-flash,glm-4.7-flashx,glm-5,glm-5-turbo,glm-5.1,glm-4.5,glm-4.5-x,glm-4.5-air,glm-4.5-airx,glm-4.5-flash,glm-5v-turbo,glm-4.5v,glm-4.6v,glm-4.6v-flash
- Xiaomi mimo
mimo-v2-pro,mimo-v2-omni,mimo-v2-flash,mimo-v2.5-pro,mimo-v2.5
- MiniMax
MiniMax-M2,MiniMax-M2.1,MiniMax-M2.5,MiniMax-M2.5-highspeed,MiniMax-M2.7,MiniMax-M2.7-highspeed
- Qwen
qwen3.6-max-preview,qwen3.6-plus,qwen3.6-flash,qwen3.5-plus,qwen3.5-flash,qwen3.5-397b-a17b,qwen3.5-122b-a10b,qwen3.5-27b,qwen3.5-35b-a3b
- Baidu
ernie-5.1,ernie-5.0,ernie-5.0-thinking-perview,ernie-4.5-8k-preview,ernie-4.5-turbo-128k(ernie-4.5-turbo),ernie-4.5-turbo-32k,ernie-4.5-turbo-vl,ernie-4.5-turbo-vl-32k,ernie-4.5-0.3b,ernie-speed-pro-128k(ernie-speed-pro),ernie-lite-pro-128k(ernie-lite-pro),ernie-x1.1,ernie-x1-turbo-32k(ernie-x1-turbo)
- Hunyuan
hy3-preview,hunyuan-2.0-thinking-20251109(hunyuan-2.0-thinking),hunyuan-2.0-instruct-20251111(hunyuan-2.0-instruct)
- GLM:
embedding-2,embedding-3,embedding-3-pro - Qwen:
text-embedding-v4,text-embedding-v3,text-embedding-v2,text-embedding-v1 - Baidu:
embedding-v1,bge-large-zh,bge-large-en
pip install cnllmOne-Click Install:
npx skills add https://github.com/kanchengw/cnllmOr manually copy the SKILL.md file from the project root to your agent's skill directory. When calling Chinese LLMs, CNLLM will be used as the preferred option.
from cnllm import CNLLM
client = CNLLM(model="minimax-m2.7", api_key="your_api_key")
resp = client.chat.create(...)Async clients need to be called via await, and streaming responses are iterated via async for:
from cnllm import asyncCNLLM
import asyncio
async def main():
client = asyncCNLLM(
model="minimax-m2.7", api_key="your_api_key")
resp = await client.chat.create(...)
print(resp)
asyncio.run(main())Two context management modes are supported:
- Persistent Session maintains session state across multiple calls, suitable for applications that need to maintain context
- Temporary Session is single-use, does not maintain session state, auto-closes
Persistent Session:
client = CNLLM(
model="minimax-m2.7", api_key="your_api_key")
resp = client.chat.create(...)
client.close() # Manual close; async client uses client.aclose()Temporary Session:
with CNLLM(
model="deepseek-chat", api_key="your_api_key") as client:
resp = client.chat.create(...) # Auto-closes sessionAll methods support both sync and async clients:
| Type | Scenario | Method | Return Type |
|---|---|---|---|
| chat completions | Non-streaming single | chat.create() |
Dict |
| Streaming single | chat.create(stream=True) |
Iterator[Dict] |
|
| Non-streaming batch | chat.batch() |
BatchResponse |
|
| Streaming batch | chat.batch(stream=True) |
Iterator[Dict] |
|
| Mixed streaming batch | chat.batch(requests=[{"stream": True}, {"stream": False}]) |
BatchResponse |
|
| embeddings | Embeddings single | embeddings.create() |
Dict |
| Embeddings batch | embeddings.batch() |
EmbeddingResponse |
Three calling methods are supported, with the simplest being one line of code, one parameter:
Simplified Call:
Does not support any parameters other than strings (streaming can be configured at client level with stream=True parameter).
resp = client("Introduce yourself in one sentence")Standard Call:
resp = client.chat.create(prompt="Introduce yourself in one sentence", stream=True)Full Call:
resp = client.chat.create(
messages=[
{"role": "user", "content": "Introduce yourself in one sentence"},
{"role": "assistant", "content": "I am an intelligent assistant"},
{"role": "user", "content": "Hello"},
]
)resp = client.chat.create(
messages=[{"role": "user", "content": "Introduce yourself in one sentence"}],
)resp = client.chat.create(
prompt="Introduce yourself in one sentence",
stream=True
)
for chunk in resp:
print(resp.still) # Real-time accumulated model response text
print(resp.raw) # Complete accumulated model native responseIn streaming calls, access via for loop with real-time accumulation for responses or the following key fields; non-streaming calls do not support for iteration, and access returns complete field content:
| Response Field | Access Method | Return Format | Example |
|---|---|---|---|
| resp: standard response | resp |
Dict/List[Dict] |
{non-streaming standard response}/[streaming chunks list] |
think: reasoning_content |
resp.think |
str |
"reasoning content..." |
still: content |
resp.still |
str |
"response content..." |
tools: tool_calls |
resp.tools |
Dict[int, Dict] |
{0: {"id": "...", "function": {...}}, 1: {...} |
| raw: model native response | resp.raw |
Dict |
{"id": "...", "choices": [...], ...} |
repr(): During streaming, displays real-time merged chunks and accumulated field results, not the real-time streaming chunks list; does not change the streaming response object type, which is an iterator containing all standard streaming chunks.
for chunk in resp:
print(resp)
# {'id': '...', 'object': '...', 'created': '...', 'model': '...', 'choices': [{'delta': {'content': 'real-time accumulated model response', 'reasoning_content': 'real-time accumulated reasoning process'}, 'finish_reason': 'None'}]}You can use prompt and messages parameters for quick global configuration, or use requests parameter for independent configuration of individual requests.
prompt parameter:
resp = client.chat.batch(
prompt=["Hello", "How's the weather today", "Who are you"],
stream=True
)messages parameter:
resp = client.chat.batch(
messages=[
[{"role": "user", "content": "How's the weather in Beijing?"},
{"role": "assistant", "content": "It's sunny in Beijing"},
{"role": "user", "content": "What about Shanghai?"}],
[{"role": "user", "content": "How's the weather in Shanghai?"}],
],
tools=[get_weather]
)requests parameter:
Configure independent strategy for individual requests within batch, global parameters are inherited when not configured per-request, supports using requests.messages parameter to manage context.
resp = client.chat.batch(
requests=[
{"prompt": "How's the weather in Beijing?", "tools": [get_weather], "stream": True}, # Inherits thinking parameter from global config
{"prompt": "What is 1+1?", "tools": [calc], "thinking": False}, # Does not inherit any global parameters
{"prompt": "How's the weather in Guangzhou?", "model": "deepseek-chat", "api_key": "key"} # Inherits tools and thinking parameters from global config
],
# Global parameters (used when per-request not configured):
tools=[default_tool],
thinking=True,
max_concurrent=2 # Max concurrent: batch-level parameter, not inherited by individual requests
)BatchResponse outer structure, where each response under results[request_id] is in OpenAI standard streaming/non-streaming response structure:
{
"status": {"elapsed": "3.42s", "success_count": 2, "fail_count": 1, "total": 3}, # Statistics
"usage": {"prompt_tokens": 5, "total_tokens": 5}, # Batch processing total usage info
"errors": {"request_2": "error message"}, # Mapping of all failed requests' request_id and error messages
"results": { # Mapping of all successful requests' request_id and standard responses
"request_0": {...},
"request_1": {...}
},
"think": {"request_0": "...", "request_1": "..."},
"still": {"request_0": "...", "request_1": "..."},
"tools": {"request_0": [...], "request_1": [...]},
"raw": {"request_0": {...}, "request_1": {...}}
}Supports iterative access to response results, metadata, and key field contents within for loop, with content real-time accumulation and updates:
- In batch streaming calls, updates build chunk by chunk; in batch non-streaming calls and batch calls with mixed streaming strategies (see
requestsparameter), updates build request by request. - In batch non-streaming calls and batch calls with mixed streaming strategies, if real-time access to batch response fields is not needed, you can access complete results directly, skipping the
forloop. - Supports access by
request_idor by integer index.
Access methods:
resp = client.chat.batch(
prompt=["Hello", "How's the weather today", "Who are you"]
)
for r in resp:
print(resp.status) # Real-time statistics, request by request real-time update
print(resp.still) # Response content for all requests in batch task
# Or access via client.chat.batch_result:
for r in client.chat.batch(
prompt=["Hello", "How's the weather today", "Who are you"], stream=True
):
print(client.chat.batch_result.results) # OpenAI standard streaming responses for all requests in batch task, chunk by chunk real-time accumulation
print(client.chat.batch_result.think["request_0"]) # Reasoning content for first request in batch task, or use .think[0] integer index accessAccess fields:
| Category | Field Description | Access Method | Return Format | Example |
|---|---|---|---|---|
| Metadata | Real-time statistics | resp.status / batch_result.status |
Dict |
{"success_count": 2, "fail_count": 0, "total": 2, "elapsed": "3.42s"} |
| Real-time Token usage | resp.usage / batch_result.usage |
Dict[str, int] |
{"prompt_tokens": 50, "completion_tokens": 100, "total_tokens": 150} |
|
| errors | Error information for failed requests | resp.errors / batch_result.errors |
Dict[str, str] |
{"request_0": "error message","request_1": "error message"} |
| Error information for single request | resp.errors[0] / batch_result.errors[0] |
str |
"error message" |
|
| results | Standard response for successful requests | resp.results / batch_result.results |
Dict[str, Dict] |
{"request_0": {...}, "request_1": {...}} |
| Standard response for each request | resp.results[0] / batch_result.results[0] |
Dict |
{"id": "...", "choices": [...], ...} |
|
| think | Reasoning process content | resp.think / batch_result.think |
Dict[str, str] |
{"request_0": "...", "request_1": "..."} |
| Reasoning content for single request | resp.think[0] / batch_result.think[0] |
str |
"reasoning content..." |
|
| still | Response content | resp.still / batch_result.still |
Dict[str, str] |
{"request_0": "...", "request_1": "..."} |
| Response content for single request | resp.still[0] / batch_result.still[0] |
str |
"response content..." |
|
| tools | Tool calls | resp.tools / batch_result.tools |
Dict[str, Dict[int, Dict]] |
{"request_0": {...}, "request_1": {...}} |
| Tool calls for single request | resp.tools[0] |
Dict[int, Dict] |
{0: {"id": "...", "function": {...}}, 1: {...} |
|
| raw | Model native response | resp.raw / batch_result.raw |
Dict[str, Dict] |
{"request_0": {...}, "request_1": {...}} |
| Model native response for single request | resp.raw[0] / batch_result.raw[0] |
Dict |
{"id": "...", "choices": [...], ...} |
repr(): Displays batch processing metadata fields or response content:
print(resp)
# BatchResponse(status={...}, usage={...})
print(resp.results)prompt parameter:
resp = client.embeddings.batch(
input=["Hello", "World", "你好"],
)
print(resp.vectors) # Embedding vectors for all requests
print(resp.status) # Statistics
print(resp.usage) # Token usage statisticscustom_ids parameter:
resp = client.embeddings.batch(
input=["Text 1", "Text 2", "Text 3"],
custom_ids=["doc_001", "doc_002", "doc_003"]
)
resp.results["doc_001"] # Get response for doc_001
resp.vectors["doc_002"] # Get embedding vector for doc_002to_dict(): Converts response to dictionary:
resp.to_dict() # Default: keeps vectors field + metadata (status/usage/batch_info)
resp.to_dict(results=True) # Keeps results field + metadata (status/usage/batch_info)Batch calls support retry strategy, concurrency control parameter configuration:
| Parameter | Type | Default | Description |
|---|---|---|---|
batch_size |
int |
Dynamic | Batch size, only supported for Embeddings calls |
max_concurrent |
int |
12/3 |
Max concurrent, Embeddings default 12, Chat completions default 3 |
rps |
float |
10/2 |
Requests per second, Embeddings default 10, Chat completions default 2 |
timeout |
int |
30 | Per-request timeout (seconds) |
max_retries |
int |
3 | Max retry times |
retry_delay |
float |
1.0 | Retry delay (seconds) |
batch_size: Only supported for batch Embeddings calls, defaults to adaptive calculation based on request count, manual configuration not recommended.
Both batch chat completions/Embeddings calls support progress callbacks, custom request IDs, stop on error, field storage control, unknown parameter handling strategy.
Use custom_ids parameter to specify custom IDs for batch requests, which will replace the original request_id in batch responses.
resp = client.embeddings.batch(
input=["Text 1", "Text 2", "Text 3"],
custom_ids=["doc_001", "doc_002", "doc_003"]
)
resp.results["doc_001"] # Get response for doc_001
resp.vectors["doc_002"] # Get embedding vector for doc_002Callbacks are invoked when each request completes, which can be used for:
- Real-time display of processing progress
- Recording completed tasks
- Dynamically adjusting subsequent tasks
- ...
def on_complete(request_id, status): # Callback function example, supports customization
print(f"[{request_id}] {status}")
resp = client.chat.batch(
requests,
callbacks=[on_complete]
)When a batch request encounters the first error, it immediately throws an exception and interrupts subsequent tasks. If there are successful requests in the batch, it also returns a batch object containing already processed request results, which can be accessed normally:
resp = client.embeddings.batch(
input=requests,
stop_on_error=True
)
# Error message: {request_id} request failed, reason: {error}
# If there are successful requests in the batch, you can access the batch object normally:
resp.status
resp.vectorsBatch calls (Chat / Embeddings) can access all fields within the for loop. After iteration ends, some redundant fields are automatically released to save memory.
The keep parameter specifies which fields need to be retained after iteration:
Default behavior (when keep parameter is not specified):
| Call Type | Default Retention | Auto-released after Iteration |
|---|---|---|
client.chat.batch() |
still/think/tools and metadata |
results/errors/raw |
client.embeddings.batch() |
vectors and metadata |
results/errors |
Notes:
- When
keep=[], all fields are released after iteration, only metadata is retained; whenkeep=["*"], all fields are retained after iteration. - In
chat.batch(), metadata fields includestatus/usage; inembeddings.batch(), metadata fields includestatus/usage/batch_info.
Usage:
resp = client.embeddings.batch(
input=["Text 1", "Text 2", "Text 3"],
keep=["vectors"] # Only retain vectors field after iteration
)
for _ in resp:
print(resp.results) # Any field can be accessed during iteration, request by request real-time accumulation
resp.vectors["request_0"] # Accessible after iteration
resp.results["request_0"] # Not accessible after iteration, returns warningCan also set global default at client initialization:
client = CNLLM(..., keep=["vectors"])Use drop_params to control the handling behavior of incompatible parameters and other unknown parameters held by the client during actual calls. The default strategy is warn mode.
| Strategy | Configuration | Behavior |
|---|---|---|
| Warning mode (default) | drop_params="warn" |
Prints warning log, parameter is discarded, request continues |
| Strict mode | drop_params="strict" |
Throws TypeError, request terminated |
| Silent ignore mode | drop_params="ignore" |
Silently discards unknown parameters, no logs generated |
Notes:
-
When doing batch calls, if global parameters contain unknown parameters,
drop_params="strict"directly throws an exception without actually starting the batch task; If a single request within the batch task contains unknown parameters,drop_params="strict"directly puts that request into theerrorsfield without actually executing that request, and continues executing subsequent batch tasks. -
Specifically, when configured with
drop_params="strict"andstop_on_error=True, the first error encountered in batch requests immediately interrupts the batch task while returning already processed request results. See Stop on Error. -
The
drop_paramsparameter supports client configuration and all calling methods (includingcreatesingle-call method).
CNLLM's streaming, non-streaming, and Embeddings response formats for single requests are fully aligned with OpenAI standard structure.
{
"id": "chatcmpl-xxx",
"object": "chat.completion",
"created": 1234567890,
"model": "minimax-m2.7",
"choices": [{
"index": 0,
"message": {
"role": "assistant",
"content": "Hello, I am MiniMax-M2.7...",
"reasoning_content": "reasoning process content..." # Model reasoning process, if any
"tool_calls": [{ # Tool calls, if any
"id": "call_xxx",
"type": "function",
"function": {"name": "get_weather", "arguments": "{\"location\":\"Beijing\"}"}
}]
},
"finish_reason": "stop"
}],
"usage": {
"prompt_tokens": 10,
"completion_tokens": 20,
"total_tokens": 30,
"prompt_tokens_details": {
"cached_tokens": 0
},
"completion_tokens_details": {
"reasoning_tokens": 0
}
}
}{'id': 'chatcmpl-xxx', 'object': 'chat.completion.chunk', 'created': 1234567890, 'model': 'minimax-m2.7', 'choices': [{'index': 0, 'delta': {'role': 'assistant'}, 'finish_reason': None}]}
# reasoning_content chunks (model reasoning process, if any):
{'id': 'chatcmpl-xxx', 'object': 'chat.completion.chunk', 'created': 1234567890, 'model': 'minimax-m2.7', 'choices': [{'index': 0, 'delta': {'reasoning_content': 'reasoning..'}, 'finish_reason': None}]}
# tool_calls chunks (tool calls, if any):
{'id': 'chatcmpl-xxx', 'object': 'chat.completion.chunk', 'created': 1234567890, 'model': 'minimax-m2.7', 'choices': [{'index': 0, 'delta': {'tool_calls': [{'index': 0, 'id': 'call_xxx', 'type': 'function', 'function': {'name': 'get_weather', 'arguments': '...'}}]}, 'finish_reason': None}]}
{'id': 'chatcmpl-xxx', 'object': 'chat.completion.chunk', 'created': 1234567890, 'model': 'minimax-m2.7', 'choices': [{'index': 0, 'delta': {'content': 'Hello...'}, 'finish_reason': None}]}
# ... chunks
{'id': 'chatcmpl-xxx', 'object': 'chat.completion.chunk', 'created': 1234567890, 'model': 'minimax-m2.7', 'choices': [{'index': 0, 'delta': {}, 'finish_reason': 'stop'}], 'usage': {'prompt_tokens': 10, 'completion_tokens': 20, 'total_tokens': 30}}{
"object": "list",
"data": [{
"object": "embedding",
"embedding": [0.1, 0.2, ...],
"index": 0
}],
"model": "embedding-2",
"usage": {
"prompt_tokens": 5,
"total_tokens": 5
}
}Except for parameters specially noted below, other parameters can be configured at both client initialization and call entry. Call entry configuration will override client initialization configuration.
CNLLM request parameters are basically consistent with OpenAI standard parameters, with slight extensions based on domestic vendor situations. For uncovered parameters, vendor naming is used and passed through.
Note: Not all supported models support all request parameters. Please refer to vendor official documentation for confirmation, or configure drop_params="ignore" to ignore unsupported parameters.
| Parameter | Type | Default | Description |
|---|---|---|---|
model |
str |
- | Model name, required at client initialization, can be overridden at call entry |
api_key |
str |
- | API key |
base_url |
str |
Auto-adapted | Customizable API address |
messages |
list[dict]/list[list[dict]] |
- | chat() input parameter, supports context management/image recognition (call entry configuration only) |
prompt |
str/list[str] |
- | chat() input parameter (call entry configuration only) |
requests |
list[dict] |
- | chat.batch() input parameter, supports per-request independent configuration (call entry configuration only) |
input |
str/list[str] |
- | embeddings() input parameter (call entry configuration only) |
stream |
bool |
False |
Streaming response |
thinking ¹ |
bool/dict |
Determined by model endpoint, most default to False |
Thinking mode, supports True/False, some models support "auto" |
tools |
list |
- | Tool/function definition list |
¹ thinking mapping:
- GLM, DeepSeek, Baidu, Hunyuan, Xiaomi, Kimi:
True→{"type": "enabled"},False→{"type": "disabled"} - Doubao:
True→"enabled",False→"disabled","auto"→"auto" - Qwen:
True→enable_thinking: true,False→enable_thinking: false
| Parameter | Type | Default | Description |
|---|---|---|---|
temperature |
float |
Determined by model endpoint | Generation randomness |
max_completion_tokens |
int |
Determined by model endpoint | Max generated token count (including thinking chain) |
max_tokens |
int |
Determined by model endpoint | Max generated token count (excluding thinking chain) |
top_p |
float |
Determined by model endpoint | Nucleus sampling threshold |
stop |
str/list |
- | Stop sequence |
reasoning_effort |
str |
Determined by model endpoint | Reasoning depth control |
tool_choice |
str/dict |
- | Tool selection strategy |
response_format |
dict |
Determined by model endpoint, most default to {"type": "text"} |
Response format |
n |
int |
1 |
Number of generated candidates |
presence_penalty |
float |
- | Presence penalty |
frequency_penalty |
float |
- | Frequency penalty |
logit_bias |
dict |
- | Token-level bias |
user ¹ |
str |
- | User identifier |
seed |
int |
- | Random seed, same seed can reproduce results |
stream_options |
dict |
- | Streaming output config, such as {"include_usage": true} |
logprobs |
bool |
False |
Whether to return log probabilities of output tokens |
top_logprobs |
int |
0 |
Number of highest probability candidate tokens to return for each position |
¹ user mapping:
- GLM:
user→user_id
Parameters supported by models but not covered in 4.1.1/4.1.2 will be passed through by CNLLM to the model endpoint.
| Vendor | Pass-through Parameters |
|---|---|
| KIMI | prompt_cache_key, safety_identifier, stream_options |
| Doubao | service_tier, stream_options |
| GLM | do_sample, request_id, tool_stream, dimensions |
| MiniMax | stream_options, group_id |
| Qwen | enable_thinking, preserve_thinking, thinking_budget, top_k, repetition_penalty, vl_high_resolution_images, enable_code_interpreter, enable_search, search_options, parallel_tool_calls, dimensions |
| Baidu | enable_thinking, thinking_budget, thinking_strategy, penalty_score, repetition_penalty, parallel_tool_calls, web_search, metadata |
Parameters defined internally by CNLLM to control internal execution behavior or strategy, not transmitted to API endpoint.
| Parameter | Type | Default | Description |
|---|---|---|---|
timeout |
int |
60 |
Request timeout (seconds) |
max_retries |
int |
3 |
Max retry times |
retry_delay |
float |
1.0 |
Retry delay (seconds) |
fallback_models¹ |
dict |
- | Fallback models (client initialization only), see below for details |
drop_params |
str |
"warn" |
See Unknown Parameter Handling Strategy |
¹fallback_models model fallback strategy:
Fallback models are only supported at client initialization. If the primary model does not respond successfully, it will sequentially try the provided fallback_models. For application robustness, it is recommended to configure this option and set drop_params="ignore" to avoid parameter compatibility issues.
fallback_models = {
"deepseek-chat": {
"api_key": "ds-key-456", # required
"base_url": "https://api.deepseek.com/v1",
},
"qwen-plus": {
"api_key": "my-key", # when base_url is not configured, default URL is used
},
}Notes:
- Specifying
modelagain at the call entry overrides the client's primary model configuration. When the call entry'smodelfails, it will still tryfallback_models - In
chat.batch(), fallback is tried per-req independently - Non-retryable errors (model not found, missing params, content filtered) are raised directly without triggering fallback
- When all models fail,
FallbackErroris raised, aggregating all failure information
Only effective for chat.batch() and embeddings.batch() calls:
| Parameter | Type | Default | Description |
|---|---|---|---|
max_concurrent |
int |
Chat: 3 / Embeddings: 12 |
Max concurrent |
rps |
float |
Chat: 2 / Embeddings: 10 |
Requests per second limit |
batch_size |
int |
Dynamic calculation | Batch size, only supported by Embeddings |
stop_on_error |
bool |
False |
Stop subsequent requests on error, return already processed results |
callbacks |
list |
- | Progress callback function list |
custom_ids |
list[str] |
- | Custom request ID list |
keep |
set/list |
See Field Storage Control | Data fields to retain after iteration |
LangChainRunnable inherits BaseChatModel, natively supports invoke/stream/batch as well as bind_tools/with_structured_output.
from cnllm import CNLLM
from cnllm.core.framework import LangChainRunnable, LangChainEmbeddings
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.tools import tool
from pydantic import BaseModel, Field
import asyncio
# Create CNLLM client
client = CNLLM(model="deepseek-chat", api_key="your_key")
# Create Runnable instance
runnable = LangChainRunnable(client)
prompt = ChatPromptTemplate.from_messages([
("system", "You are a helpful assistant"),
("human", "{input}")
])
# Build LangChain chain
chain = prompt | runnable
# Sync calls with invoke/stream/batch
resp = chain.invoke({"input": "What is 2+2?"})
print(resp.content)
for chunk in chain.stream({"input": "Count to 5"}):
print(chunk.content, end="", flush=True)
resp = chain.batch([{"input": "Hello"}, {"input": "How are you?"}])
for r in resp:
print(r.content)
# bind_tools — tool calling
@tool
def get_weather(city: str) -> str:
"""Get weather for a city"""
return "Sunny 20°C"
llm_with_tools = runnable.bind_tools([get_weather])
resp = llm_with_tools.invoke("Weather in Beijing")
print(resp.content)
# with_structured_output — structured output
# deepseek-v4 series requires thinking=False to receive tool_choice from with_structured_output(); other models/vendors do not have this requirement
class Person(BaseModel):
name: str = Field(description="Name")
age: int = Field(description="Age")
structured = runnable.with_structured_output(Person)
result = structured.invoke("Zhang San is 28 years old")
print(result) # → Person(name="Zhang San", age=28)
# LangChainEmbeddings — embeddings
embeddings = LangChainEmbeddings(client)
vectors = embeddings.embed_documents(["Hello", "World"])
query_vec = embeddings.embed_query("query")
# Async calls with ainvoke/astream/abatch
async def main():
async with client:
resp = await chain.ainvoke({"input": "What is 2+2?"})
print(resp.content)
async for chunk in chain.astream({"input": "Count to 5"}):
print(chunk.content, end="", flush=True)
results = await chain.abatch([{"input": "A"}, {"input": "B"}])
for r in results:
print(r.content)
asyncio.run(main())CNLLM responses can be used to construct LlamaIndex's ChatMessage:
from cnllm import CNLLM
from llama_index.core.llms import ChatMessage, MessageRole
client = CNLLM(model="deepseek-chat", api_key="your_key")
resp = client.chat.create(prompt="Introduce yourself in one sentence")
msg = ChatMessage(role=MessageRole.ASSISTANT, content=resp.still)
print(msg.content)CNLLM integrates with AutoGen via OpenAI-compatible responses:
from cnllm import CNLLM
from autogen_agentchat.messages import TextMessage
client = CNLLM(model="deepseek-chat", api_key="your_key")
resp = client.chat.create(prompt="1+1=?")
msg = TextMessage(content=resp.still, source="assistant")
print(msg.content)CNLLM embeddings feed into Haystack Document, chat output constructs ChatMessage:
from cnllm import CNLLM
from haystack import Document
from haystack.dataclasses import ChatMessage
client = CNLLM(model="deepseek-chat", api_key="your_key")
# embedding → Document
text = "CNLLM is a Chinese LLM adapter"
resp = client.embeddings.create(input=text)
doc = Document(content=text, embedding=resp.vectors)
print(f"Vector dimension: {len(doc.embedding)}")
# chat → ChatMessage
resp = client.chat.create(prompt="1+1=?")
msg = ChatMessage.from_assistant(resp.still)
print(msg.text)CNLLM output feeds into DeepEval evaluation:
from cnllm import CNLLM
from deepeval.test_case import LLMTestCase
client = CNLLM(model="deepseek-chat", api_key="your_key")
resp = client.chat.create(messages=[{"role": "user", "content": "1+1=?"}])
test_case = LLMTestCase(
input="1+1=?", actual_output=resp.still, expected_output="2",
)
print(test_case.actual_output)Apache License 2.0 - See LICENSE file
- GitHub Issues: https://github.com/kanchengw/cnllm/issues
- Author Email: wangkancheng1122@163.com