encodings: decode utf-8 with errors='replace' when confident#421
encodings: decode utf-8 with errors='replace' when confident#421Rongronggg9 wants to merge 1 commit intokurtmckee:mainfrom
Conversation
dd2d6bf to
750ca5f
Compare
|
Please accept "Pull requests". Everything works as it should with him! |
"Confident" means "metadata of the document explicitly indicates that the encoding is UTF-8". This prevents feedparser from falling back to other encodings when there are only tiny errors.
750ca5f to
5fc7ed2
Compare
|
Hi @kurtmckee, could you take a look at this? I've just rebased my patch. Nowadays, non-UTF-8 web resources are rare. If the feed declares its encoding as UTF-8, it is almost impossible to be other encodings. The problem with the current methodology in feedparser is that UTF-8 is a self-synchronizing code. It is guaranteed that any tiny error in a UTF-8 document never messes up the whole document. Thus, it is safe to decode it with My patch aims to adhere to the encoding declaration when it is UTF-8. This should make UTF-8 feeds with tiny mistakes being parsed less painfully. Non-UTF-8 encoding declarations are not considered because their presence is probably related to misconfiguration. Most non-UTF-8 encodings are not self-synchronizing so that's another reason for the patch to consider UTF-8 only. |
"Confident" means "metadata of the document explicitly indicates that the encoding is UTF-8".
Background of the patch
When a UTF-8 feed has a few invalid characters but the rest is fine, feedparser will only parse it as
iso-8859-2(or other encodings detected bychardet, if installed), even if both the HTTP and XML headers explicitly indicate that its encoding isutf-8.To handle it better, we should decode the feed as UTF-8 with
errors='replace'.