-
Notifications
You must be signed in to change notification settings - Fork 34
Description
π Bug Description
Summary
The @notionhq/notion-mcp-server@1.9.1 exposes tool schemas derived from an incomplete notion-openapi.json. The MCP server itself performs no schema validation β arguments are passed straight through to the Notion REST API β but the restrictive schemas in the prompt systematically penalize models that strictly follow declared tool schemas, while models that ignore schema constraints can bypass the limitations and succeed.
This means MCPMark Notion benchmark scores partly measure a model's willingness to violate tool schemas, rather than its actual task-solving capability.
Affected Component
src/agents/react_agent.py β _render_tools_description() (line ~449)
src/agents/base_agent.py β _create_stdio_server() (line ~173)
The tool schemas from mcp_server.list_tools() are rendered verbatim into the prompt without any correction.
Root Cause
The upstream @notionhq/notion-mcp-server package uses a notion-openapi.json that only partially describes the Notion API. The MCP server's architecture (parser.ts β proxy.ts β http-client.ts) faithfully converts this incomplete spec into tool schemas, but never validates arguments against them β it just forwards everything to the Notion REST API.
Specific Schema Issues
Issue 1 (Critical): API-patch-block-children β only 2 of 25+ block types declared
Schema says:
{
"children.items.properties.type": {
"type": "string",
"enum": ["paragraph", "bulleted_list_item"]
},
"children.items.additionalProperties": false
}API actually accepts: heading_1, heading_2, heading_3, to_do, toggle, callout, quote, divider, table, column_list, code, equation, bookmark, numbered_list_item, table_of_contents, breadcrumb, synced_block, image, video, file, pdf, audio, etc.
Also: nested rich_text items have additionalProperties: false, blocking annotations (bold, italic, color, etc.)
Impact: Models following the schema cannot create headings, callouts, dividers, toggles, or any formatted text.
Issue 2 (Critical): API-post-page children items typed as string
Schema says:
{ "children.items": { "type": "string" } }API actually accepts: Block objects (same as patch-block-children)
Impact: Models serialize block objects as JSON strings β Notion API rejects with "body.children[0] should be an object, instead was string". This causes 100% failure for all API-post-page calls with inline children.
Issue 3 (High): API-post-page parent requires page_id, doesn't declare database_id
Schema says:
{ "parent.required": ["page_id"] }No database_id property is declared.
API actually accepts: { "database_id": "<uuid>" } (without page_id) for creating database entries.
Impact: Schema-compliant models cannot add entries to databases. They try dozens of page_id permutations (dummy UUIDs, empty strings, etc.) and exhaust their turn budget.
Issue 4 (High): API-post-page / API-patch-page properties β additionalProperties: false
Schema says:
{
"properties.additionalProperties": false,
"properties.properties": { "title": {...}, "type": {...} }
}API actually accepts: Any property name/type (select, number, rich_text, date, checkbox, url, formula, relation, etc.)
Impact: Models can only set title properties. Cannot populate any custom database columns.
Issue 5 (Moderate): API-create-a-database properties β oneOf only allows title
Schema says:
{
"properties.additionalProperties.oneOf": [{
"required": ["title"],
"additionalProperties": false
}]
}API actually accepts: Any property type: title, rich_text, number, select, multi_select, date, checkbox, url, email, phone_number, formula, relation, rollup, status, etc.
Impact: Models following the schema can only create databases with title-type properties.
Impact on Benchmark Fairness
Without this fix, the Notion benchmark conflates two unrelated capabilities:
- Task-solving ability (understanding Notion structures, computing correct data, etc.)
- Schema violation willingness (ignoring
additionalProperties: false,required, andenumconstraints)
Models that are more instruction-following (treating tool schemas as contracts) are systematically penalized, while models that treat schemas as advisory guidance are rewarded. This undermines the benchmark's ability to measure actual MCP tool-use competence.
π· Recurrence Steps
No response
π¦ Expected Behavior
No response
π Additional Information
No response