[Feature]: Add nemotron_json as built-in tool parser (NVIDIA Nemotron-Nano-9B-v2 plugin breaks against v0.20.x module reorg)

### 🚀 The feature, motivation and pitch

## Context

`nvidia/NVIDIA-Nemotron-Nano-9B-v2` ships an out-of-tree tool-call parser plugin (`nemotron_toolcall_parser_no_streaming.py`) that NVIDIA's own [vLLM cookbook][cookbook] tells users to load via:

    --enable-auto-tool-choice
    --tool-parser-plugin "<repo>/nemotron_toolcall_parser_no_streaming.py"
    --tool-call-parser nemotron_json

The cookbook pins vLLM to commit `75531a6c…` (2025-08-15). The plugin file in NVIDIA's HF model repo has not been updated since.

[cookbook]: https://github.com/NVIDIA-NeMo/Nemotron/blob/main/usage-cookbook/Nemotron-Nano-9B-v2/vllm_cookbook.ipynb

## What breaks on v0.20.x

Three import paths in the plugin no longer resolve, plus the `ToolParser.__init__` calling convention changed:

| Symbol / surface | Old (Aug-2025 vLLM) | v0.20.1 |
|---|---|---|
| `ChatCompletionRequest` | `vllm.entrypoints.openai.protocol` | `vllm.entrypoints.openai.chat_completion.protocol` |
| `FunctionCall, ToolCall, DeltaFunctionCall, DeltaToolCall, DeltaMessage, ExtractedToolCallInformation` | `vllm.entrypoints.openai.protocol` | `vllm.entrypoints.openai.engine.protocol` |
| `ToolParser, ToolParserManager` | `vllm.entrypoints.openai.tool_parsers.abstract_tool_parser` | `vllm.tool_parsers.abstract_tool_parser` |
| `AnyTokenizer` | `vllm.transformers_utils.tokenizer` | renamed to `TokenizerLike` in `vllm.tokenizers.protocol` |
| `ToolParser.__init__(tokenizer)` | one positional arg | now called as `tool_parser(tokenizer, request.tools)` (see `vllm/entrypoints/serve/render/serving.py`) — subclasses must accept the second arg |

Result against current vLLM: server fails to start with `KeyError: 'invalid tool call parser: nemotron_json'` (plugin can't be imported), and even after fixing imports the parser raises `TypeError: __init__() takes 2 positional arguments but 3 were given` on the first request that carries `tools=[…]`.

## Patched plugin (works against v0.20.1)

Only imports + `AnyTokenizer -> TokenizerLike` rename + `__init__` accepts `tools`; parsing logic is identical to NVIDIA's upstream.

<details>
<summary>nemotron_parser.py</summary>

```python
# SPDX-License-Identifier: Apache-2.0

import json
import re
from typing import Union

from vllm.entrypoints.openai.chat_completion.protocol import ChatCompletionRequest
from vllm.entrypoints.openai.engine.protocol import (
    DeltaMessage,
    ExtractedToolCallInformation,
    FunctionCall,
    ToolCall,
)
from vllm.tool_parsers.abstract_tool_parser import ToolParser, ToolParserManager
from vllm.logger import init_logger
from vllm.tokenizers.protocol import TokenizerLike

logger = init_logger(__name__)


@ToolParserManager.register_module("nemotron_json")
class NemotronJSONToolParser(ToolParser):
    def __init__(self, tokenizer: TokenizerLike, tools=None):
        super().__init__(tokenizer, tools)
        self.tool_call_start_token = "<TOOLCALL>"
        self.tool_call_end_token = "</TOOLCALL>"
        self.tool_call_regex = re.compile(r"<TOOLCALL>(.*?)</TOOLCALL>", re.DOTALL)

    def extract_tool_calls(
        self, model_output: str, request: ChatCompletionRequest
    ) -> ExtractedToolCallInformation:
        if self.tool_call_start_token not in model_output:
            return ExtractedToolCallInformation(
                tools_called=False, tool_calls=[], content=model_output
            )
        try:
            str_calls = self.tool_call_regex.findall(model_output)[0].strip()
            if not str_calls.startswith("["):
                str_calls = "[" + str_calls
            if not str_calls.endswith("]"):
                str_calls = str_calls + "]"
            tool_calls = []
            for tc in json.loads(str_calls):
                try:
                    args = tc["arguments"]
                    tool_calls.append(ToolCall(
                        type="function",
                        function=FunctionCall(
                            name=tc["name"],
                            arguments=json.dumps(args, ensure_ascii=False)
                                if isinstance(args, dict) else args,
                        ),
                    ))
                except Exception:
                    continue
            content = model_output[:model_output.rfind(self.tool_call_start_token)]
            return ExtractedToolCallInformation(
                tools_called=True, tool_calls=tool_calls,
                content=content if content else None,
            )
        except Exception:
            logger.exception("Error extracting tool call from: %s", model_output)
            return ExtractedToolCallInformation(
                tools_called=False, tool_calls=[], content=model_output
            )

    def extract_tool_calls_streaming(self, *_args, **_kwargs) -> Union[DeltaMessage, None]:
        raise NotImplementedError("Streaming not supported")
```

</details>

## Proposal

Either

- accept this as a built-in `nemotron_json` parser under `vllm/tool_parsers/` (the format `<TOOLCALL>[{"name": ..., "arguments": ...}, ...]</TOOLCALL>` is baked into the model's chat template, so it's a stable target), or
- coordinate with NVIDIA to refresh the plugin in their HF model repo.

Happy with whichever. Flagging because the current state is silently broken for anyone following NVIDIA's official cookbook against current vLLM.

## Reproduction

vLLM 0.20.1 + `vllm serve nvidia/NVIDIA-Nemotron-Nano-9B-v2-NVFP4 --enable-auto-tool-choice --tool-parser-plugin <upstream-plugin> --tool-call-parser nemotron_json` with the upstream plugin file → ImportError chain ending in `KeyError: 'invalid tool call parser: nemotron_json'`. After patching imports, first request with `tools=[…]` raises `TypeError: NemotronJSONToolParser.__init__() takes 2 positional arguments but 3 were given`.

### Alternatives

_No response_

### Additional context

_No response_

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Symbol / surface	Old (Aug-2025 vLLM)	v0.20.1
`ChatCompletionRequest`	`vllm.entrypoints.openai.protocol`	`vllm.entrypoints.openai.chat_completion.protocol`
`FunctionCall, ToolCall, DeltaFunctionCall, DeltaToolCall, DeltaMessage, ExtractedToolCallInformation`	`vllm.entrypoints.openai.protocol`	`vllm.entrypoints.openai.engine.protocol`
`ToolParser, ToolParserManager`	`vllm.entrypoints.openai.tool_parsers.abstract_tool_parser`	`vllm.tool_parsers.abstract_tool_parser`
`AnyTokenizer`	`vllm.transformers_utils.tokenizer`	renamed to `TokenizerLike` in `vllm.tokenizers.protocol`
`ToolParser.__init__(tokenizer)`	one positional arg	now called as `tool_parser(tokenizer, request.tools)` (see `vllm/entrypoints/serve/render/serving.py`) — subclasses must accept the second arg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature]: Add nemotron_json as built-in tool parser (NVIDIA Nemotron-Nano-9B-v2 plugin breaks against v0.20.x module reorg) #42065

🚀 The feature, motivation and pitch

Context

What breaks on v0.20.x

Patched plugin (works against v0.20.1)

Proposal

Reproduction

Alternatives

Additional context

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Feature]: Add nemotron_json as built-in tool parser (NVIDIA Nemotron-Nano-9B-v2 plugin breaks against v0.20.x module reorg) #42065

Description

🚀 The feature, motivation and pitch

Context

What breaks on v0.20.x

Patched plugin (works against v0.20.1)

Proposal

Reproduction

Alternatives

Additional context

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions