feat: preserve message entities as markdown by ohld · Pull Request #63 · chigwell/telegram-mcp

ohld · 2026-02-19T17:07:30Z

Summary

Added message_to_markdown() function that converts Telethon message entities to markdown format
Fixed UTF-16 offset handling for emoji characters (Telegram uses UTF-16 code units, Python uses Unicode code points)
Updated all 10 message retrieval functions to use the new converter

What changed

Previously, all message retrieval functions (get_messages, list_messages, search_messages, get_history, get_pinned_messages, get_chat, get_last_interaction, get_message_context, format_message) used msg.message which returns plain text only. All Telegram formatting entities were lost:

Hyperlinks (MessageEntityTextUrl) — visible text returned, URL lost entirely
Bold (MessageEntityBold) — formatting stripped
Italic (MessageEntityItalic) — formatting stripped
Strikethrough, code, pre — all stripped

Now entities are converted to markdown:

MessageEntityBold → **text**
MessageEntityItalic → _text_
MessageEntityStrike → ~~text~~
MessageEntityCode → `text`
MessageEntityPre → ```lang\ntext\n```
MessageEntityTextUrl → [text](url)
MessageEntityMentionName → [text](tg://user?id=X)
MessageEntityUrl, MessageEntityMention, MessageEntityHashtag — kept as-is (already visible in text)

UTF-16 fix

Telegram entities use UTF-16 code units for offsets. Characters outside the BMP (most emoji like 👨, 🔗, etc.) take 2 UTF-16 code units but only 1 Python character. Without conversion, entity boundaries shift and formatting appears mid-word (e.g., с**вязка** instead of **связка**).

Added _utf16_to_python_offsets() to correctly convert between the two coordinate systems.

Test plan

Unit tests with plain text (no entities)
Unit tests with bold, italic, links
Unit tests with emoji characters (UTF-16 surrogate pairs)
Tested on real Telegram channel with ~30 posts
Verified hyperlinks preserved in real posts

🤖 Generated with Claude Code

Previously all message retrieval functions used msg.message (plain text), losing formatting entities. Now message_to_markdown() converts entities to markdown: **bold**, _italic_, ~~strike~~, `code`, [text](url). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Telegram entities use UTF-16 code units for offsets, but Python strings use Unicode code points. Emoji characters (outside BMP) take 2 UTF-16 units but 1 Python char, causing entity boundaries to shift mid-word. Added _utf16_to_python_offsets() to correctly convert between the two. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

ohld and others added 2 commits February 19, 2026 23:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: preserve message entities as markdown#63

feat: preserve message entities as markdown#63
ohld wants to merge 2 commits intochigwell:mainfrom
ohld:main

ohld commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ohld commented Feb 19, 2026

Summary

What changed

UTF-16 fix

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant