🚀 The feature, motivation and pitch
Context
nvidia/NVIDIA-Nemotron-Nano-9B-v2 ships an out-of-tree tool-call parser plugin (nemotron_toolcall_parser_no_streaming.py) that NVIDIA's own vLLM cookbook tells users to load via:
--enable-auto-tool-choice
--tool-parser-plugin "<repo>/nemotron_toolcall_parser_no_streaming.py"
--tool-call-parser nemotron_json
The cookbook pins vLLM to commit 75531a6c… (2025-08-15). The plugin file in NVIDIA's HF model repo has not been updated since.
What breaks on v0.20.x
Three import paths in the plugin no longer resolve, plus the ToolParser.__init__ calling convention changed:
| Symbol / surface |
Old (Aug-2025 vLLM) |
v0.20.1 |
ChatCompletionRequest |
vllm.entrypoints.openai.protocol |
vllm.entrypoints.openai.chat_completion.protocol |
FunctionCall, ToolCall, DeltaFunctionCall, DeltaToolCall, DeltaMessage, ExtractedToolCallInformation |
vllm.entrypoints.openai.protocol |
vllm.entrypoints.openai.engine.protocol |
ToolParser, ToolParserManager |
vllm.entrypoints.openai.tool_parsers.abstract_tool_parser |
vllm.tool_parsers.abstract_tool_parser |
AnyTokenizer |
vllm.transformers_utils.tokenizer |
renamed to TokenizerLike in vllm.tokenizers.protocol |
ToolParser.__init__(tokenizer) |
one positional arg |
now called as tool_parser(tokenizer, request.tools) (see vllm/entrypoints/serve/render/serving.py) — subclasses must accept the second arg |
Result against current vLLM: server fails to start with KeyError: 'invalid tool call parser: nemotron_json' (plugin can't be imported), and even after fixing imports the parser raises TypeError: __init__() takes 2 positional arguments but 3 were given on the first request that carries tools=[…].
Patched plugin (works against v0.20.1)
Only imports + AnyTokenizer -> TokenizerLike rename + __init__ accepts tools; parsing logic is identical to NVIDIA's upstream.
nemotron_parser.py
# SPDX-License-Identifier: Apache-2.0
import json
import re
from typing import Union
from vllm.entrypoints.openai.chat_completion.protocol import ChatCompletionRequest
from vllm.entrypoints.openai.engine.protocol import (
DeltaMessage,
ExtractedToolCallInformation,
FunctionCall,
ToolCall,
)
from vllm.tool_parsers.abstract_tool_parser import ToolParser, ToolParserManager
from vllm.logger import init_logger
from vllm.tokenizers.protocol import TokenizerLike
logger = init_logger(__name__)
@ToolParserManager.register_module("nemotron_json")
class NemotronJSONToolParser(ToolParser):
def __init__(self, tokenizer: TokenizerLike, tools=None):
super().__init__(tokenizer, tools)
self.tool_call_start_token = "<TOOLCALL>"
self.tool_call_end_token = "</TOOLCALL>"
self.tool_call_regex = re.compile(r"<TOOLCALL>(.*?)</TOOLCALL>", re.DOTALL)
def extract_tool_calls(
self, model_output: str, request: ChatCompletionRequest
) -> ExtractedToolCallInformation:
if self.tool_call_start_token not in model_output:
return ExtractedToolCallInformation(
tools_called=False, tool_calls=[], content=model_output
)
try:
str_calls = self.tool_call_regex.findall(model_output)[0].strip()
if not str_calls.startswith("["):
str_calls = "[" + str_calls
if not str_calls.endswith("]"):
str_calls = str_calls + "]"
tool_calls = []
for tc in json.loads(str_calls):
try:
args = tc["arguments"]
tool_calls.append(ToolCall(
type="function",
function=FunctionCall(
name=tc["name"],
arguments=json.dumps(args, ensure_ascii=False)
if isinstance(args, dict) else args,
),
))
except Exception:
continue
content = model_output[:model_output.rfind(self.tool_call_start_token)]
return ExtractedToolCallInformation(
tools_called=True, tool_calls=tool_calls,
content=content if content else None,
)
except Exception:
logger.exception("Error extracting tool call from: %s", model_output)
return ExtractedToolCallInformation(
tools_called=False, tool_calls=[], content=model_output
)
def extract_tool_calls_streaming(self, *_args, **_kwargs) -> Union[DeltaMessage, None]:
raise NotImplementedError("Streaming not supported")
Proposal
Either
- accept this as a built-in
nemotron_json parser under vllm/tool_parsers/ (the format <TOOLCALL>[{"name": ..., "arguments": ...}, ...]</TOOLCALL> is baked into the model's chat template, so it's a stable target), or
- coordinate with NVIDIA to refresh the plugin in their HF model repo.
Happy with whichever. Flagging because the current state is silently broken for anyone following NVIDIA's official cookbook against current vLLM.
Reproduction
vLLM 0.20.1 + vllm serve nvidia/NVIDIA-Nemotron-Nano-9B-v2-NVFP4 --enable-auto-tool-choice --tool-parser-plugin <upstream-plugin> --tool-call-parser nemotron_json with the upstream plugin file → ImportError chain ending in KeyError: 'invalid tool call parser: nemotron_json'. After patching imports, first request with tools=[…] raises TypeError: NemotronJSONToolParser.__init__() takes 2 positional arguments but 3 were given.
Alternatives
No response
Additional context
No response
Before submitting a new issue...
🚀 The feature, motivation and pitch
Context
nvidia/NVIDIA-Nemotron-Nano-9B-v2ships an out-of-tree tool-call parser plugin (nemotron_toolcall_parser_no_streaming.py) that NVIDIA's own vLLM cookbook tells users to load via:The cookbook pins vLLM to commit
75531a6c…(2025-08-15). The plugin file in NVIDIA's HF model repo has not been updated since.What breaks on v0.20.x
Three import paths in the plugin no longer resolve, plus the
ToolParser.__init__calling convention changed:ChatCompletionRequestvllm.entrypoints.openai.protocolvllm.entrypoints.openai.chat_completion.protocolFunctionCall, ToolCall, DeltaFunctionCall, DeltaToolCall, DeltaMessage, ExtractedToolCallInformationvllm.entrypoints.openai.protocolvllm.entrypoints.openai.engine.protocolToolParser, ToolParserManagervllm.entrypoints.openai.tool_parsers.abstract_tool_parservllm.tool_parsers.abstract_tool_parserAnyTokenizervllm.transformers_utils.tokenizerTokenizerLikeinvllm.tokenizers.protocolToolParser.__init__(tokenizer)tool_parser(tokenizer, request.tools)(seevllm/entrypoints/serve/render/serving.py) — subclasses must accept the second argResult against current vLLM: server fails to start with
KeyError: 'invalid tool call parser: nemotron_json'(plugin can't be imported), and even after fixing imports the parser raisesTypeError: __init__() takes 2 positional arguments but 3 were givenon the first request that carriestools=[…].Patched plugin (works against v0.20.1)
Only imports +
AnyTokenizer -> TokenizerLikerename +__init__acceptstools; parsing logic is identical to NVIDIA's upstream.nemotron_parser.py
Proposal
Either
nemotron_jsonparser undervllm/tool_parsers/(the format<TOOLCALL>[{"name": ..., "arguments": ...}, ...]</TOOLCALL>is baked into the model's chat template, so it's a stable target), orHappy with whichever. Flagging because the current state is silently broken for anyone following NVIDIA's official cookbook against current vLLM.
Reproduction
vLLM 0.20.1 +
vllm serve nvidia/NVIDIA-Nemotron-Nano-9B-v2-NVFP4 --enable-auto-tool-choice --tool-parser-plugin <upstream-plugin> --tool-call-parser nemotron_jsonwith the upstream plugin file → ImportError chain ending inKeyError: 'invalid tool call parser: nemotron_json'. After patching imports, first request withtools=[…]raisesTypeError: NemotronJSONToolParser.__init__() takes 2 positional arguments but 3 were given.Alternatives
No response
Additional context
No response
Before submitting a new issue...