encodings: decode utf-8 with errors='replace' when confident by Rongronggg9 · Pull Request #421 · kurtmckee/feedparser

Rongronggg9 · 2023-12-24T20:29:37Z

"Confident" means "metadata of the document explicitly indicates that the encoding is UTF-8".

Background of the patch

When a UTF-8 feed has a few invalid characters but the rest is fine, feedparser will only parse it as iso-8859-2 (or other encodings detected by chardet, if installed), even if both the HTTP and XML headers explicitly indicate that its encoding is utf-8.

To handle it better, we should decode the feed as UTF-8 with errors='replace'.

I met the problem at On the same site, different recognition of encoding Rongronggg9/RSS-to-Telegram-Bot#391
- Feed URL: http://iptvin.ru/component/jcomments/?task=rss&object_id=1000707&object_group=com_content&tmpl=component
- Snapshot of the feed: iptvin.xml.gz
- Snapshot of HTTP headers:

Date: Sun, 24 Dec 2023 16:23:48 GMT
Server: Apache/2.0.59 (Win32) PHP/5.1.6
X-Powered-By: PHP/5.1.6
Cache-Control: no-store, no-cache, must-revalidate
Expires: Sun, 24 Dec 2023 16:38:48 GMT
Set-Cookie: REDACTED
P3P: REDACTED
Access-Control-Allow-Origin: *
Transfer-Encoding: chunked
Content-Type: application/rss+xml; charset=utf-8

butaford · 2024-01-23T08:24:51Z

Please accept "Pull requests". Everything works as it should with him!

"Confident" means "metadata of the document explicitly indicates that the encoding is UTF-8". This prevents feedparser from falling back to other encodings when there are only tiny errors.

Rongronggg9 · 2024-12-15T16:22:17Z

Hi @kurtmckee, could you take a look at this? I've just rebased my patch.

Nowadays, non-UTF-8 web resources are rare. If the feed declares its encoding as UTF-8, it is almost impossible to be other encodings.

The problem with the current methodology in feedparser is that iso-8859-2 is always a "catch-all" option, making any feeds with just tiny mistakes fall back to it. This behavior could mess things up in most scenarios.

UTF-8 is a self-synchronizing code. It is guaranteed that any tiny error in a UTF-8 document never messes up the whole document. Thus, it is safe to decode it with errors='replace'.

My patch aims to adhere to the encoding declaration when it is UTF-8. This should make UTF-8 feeds with tiny mistakes being parsed less painfully. Non-UTF-8 encoding declarations are not considered because their presence is probably related to misconfiguration. Most non-UTF-8 encodings are not self-synchronizing so that's another reason for the patch to consider UTF-8 only.

Rongronggg9 force-pushed the fix/encoding-confidence branch 3 times, most recently from dd2d6bf to 750ca5f Compare December 26, 2023 18:29

Rongronggg9 marked this pull request as ready for review December 27, 2023 01:43

Rongronggg9 mentioned this pull request Sep 24, 2024

Title Strange Characters issue when reading RSS XML files not encoded in utf-8 #478

Closed

encodings: decode utf-8 with errors='replace' when confident

5fc7ed2

"Confident" means "metadata of the document explicitly indicates that the encoding is UTF-8". This prevents feedparser from falling back to other encodings when there are only tiny errors.

Rongronggg9 force-pushed the fix/encoding-confidence branch from 750ca5f to 5fc7ed2 Compare December 15, 2024 15:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

encodings: decode utf-8 with errors='replace' when confident#421

encodings: decode utf-8 with errors='replace' when confident#421
Rongronggg9 wants to merge 1 commit intokurtmckee:mainfrom
Rongronggg9:fix/encoding-confidence

Rongronggg9 commented Dec 24, 2023 •

edited

Loading

Uh oh!

butaford commented Jan 23, 2024

Uh oh!

Rongronggg9 commented Dec 15, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Rongronggg9 commented Dec 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Background of the patch

Uh oh!

butaford commented Jan 23, 2024

Uh oh!

Rongronggg9 commented Dec 15, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Rongronggg9 commented Dec 24, 2023 •

edited

Loading