Skip to content

feat: preserve message entities as markdown#63

Open
ohld wants to merge 2 commits intochigwell:mainfrom
ohld:main
Open

feat: preserve message entities as markdown#63
ohld wants to merge 2 commits intochigwell:mainfrom
ohld:main

Conversation

@ohld
Copy link

@ohld ohld commented Feb 19, 2026

Summary

  • Added message_to_markdown() function that converts Telethon message entities to markdown format
  • Fixed UTF-16 offset handling for emoji characters (Telegram uses UTF-16 code units, Python uses Unicode code points)
  • Updated all 10 message retrieval functions to use the new converter

What changed

Previously, all message retrieval functions (get_messages, list_messages, search_messages, get_history, get_pinned_messages, get_chat, get_last_interaction, get_message_context, format_message) used msg.message which returns plain text only. All Telegram formatting entities were lost:

  • Hyperlinks (MessageEntityTextUrl) — visible text returned, URL lost entirely
  • Bold (MessageEntityBold) — formatting stripped
  • Italic (MessageEntityItalic) — formatting stripped
  • Strikethrough, code, pre — all stripped

Now entities are converted to markdown:

  • MessageEntityBold**text**
  • MessageEntityItalic_text_
  • MessageEntityStrike~~text~~
  • MessageEntityCode`text`
  • MessageEntityPre```lang\ntext\n```
  • MessageEntityTextUrl[text](url)
  • MessageEntityMentionName[text](tg://user?id=X)
  • MessageEntityUrl, MessageEntityMention, MessageEntityHashtag — kept as-is (already visible in text)

UTF-16 fix

Telegram entities use UTF-16 code units for offsets. Characters outside the BMP (most emoji like 👨, 🔗, etc.) take 2 UTF-16 code units but only 1 Python character. Without conversion, entity boundaries shift and formatting appears mid-word (e.g., с**вязка** instead of **связка**).

Added _utf16_to_python_offsets() to correctly convert between the two coordinate systems.

Test plan

  • Unit tests with plain text (no entities)
  • Unit tests with bold, italic, links
  • Unit tests with emoji characters (UTF-16 surrogate pairs)
  • Tested on real Telegram channel with ~30 posts
  • Verified hyperlinks preserved in real posts

🤖 Generated with Claude Code

ohld and others added 2 commits February 19, 2026 23:29
Previously all message retrieval functions used msg.message (plain text),
losing formatting entities. Now message_to_markdown() converts entities
to markdown: **bold**, _italic_, ~~strike~~, `code`, [text](url).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Telegram entities use UTF-16 code units for offsets, but Python strings
use Unicode code points. Emoji characters (outside BMP) take 2 UTF-16
units but 1 Python char, causing entity boundaries to shift mid-word.

Added _utf16_to_python_offsets() to correctly convert between the two.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant