Skip to content

fix: 修复ocr编码问题#1577

Closed
A-nony-mous wants to merge 2 commits into
OneDragon-Anything:mainfrom
A-nony-mous:fix/unicode-decode-error
Closed

fix: 修复ocr编码问题#1577
A-nony-mous wants to merge 2 commits into
OneDragon-Anything:mainfrom
A-nony-mous:fix/unicode-decode-error

Conversation

@A-nony-mous
Copy link
Copy Markdown
Contributor

@A-nony-mous A-nony-mous commented Oct 20, 2025

No description provided.

Summary by CodeRabbit

发布说明

  • 错误修复
    • 优化字符字典读取:增加多层编码回退(UTF-8 → GBK → 忽略错误的UTF-8),并在失败时记录行号与内容预览(十六进制),提升兼容性与可诊断性。
    • 增强运行时容错:当识别后处理出错时,会保存本次批次的图像以便调试,并记录详细错误信息后再次抛出,便于问题复现与排查。

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Oct 20, 2025

Walkthrough

扩展 src/onnxocr/rec_postprocess.py 的字典读取解码逻辑(按行尝试 UTF‑8 → GBK → UTF‑8 errors="ignore",并记录行号与详细日志);在 src/onnxocr/predict_rec.py 为后处理异常增加捕获:保存批量图像到 ocr_error_images 并记录错误后重抛。

Changes

内聚组 / 文件(s) 变化摘要
解码与日志增强
src/onnxocr/rec_postprocess.py
增加 osdatetime 与日志工具导入;按行引入 line_num;对每行先尝试 UTF‑8 解码,失败回退 GBK,再失败回退到 UTF‑8 with errors="ignore";在回退或忽略路径记录详细错误与该行的十六进制预览;保留成功追加及可选空格追加逻辑。
后处理异常调试输出
src/onnxocr/predict_rec.py
在调用后处理(postprocess)周围增加 try/except;异常时创建 ocr_error_images 目录,按批次与时间戳保存当前批次中所有图像为文件以供调试,记录错误和保存路径,然后重抛异常。

Sequence Diagram(s)

sequenceDiagram
    participant Reader as 字典读取器
    participant Decoder as 解码流程
    participant Logger as 日志系统

    Reader->>Decoder: 读取原始二进制行 (包含 line_num)
    alt UTF-8 成功
        Decoder->>Reader: 返回 decoded_line (UTF-8)
        Decoder->>Logger: info: 行解码成功 (line_num)
    else UTF-8 失败
        Decoder->>Decoder: 尝试 GBK 解码
        alt GBK 成功
            Decoder->>Reader: 返回 decoded_line (GBK)
            Decoder->>Logger: warning: 使用 GBK 回退 (line_num)
        else GBK 失败
            Decoder->>Decoder: 尝试 UTF-8 with errors="ignore"
            alt 有可用内容
                Decoder->>Reader: 返回 decoded_line (ignore 模式)
                Decoder->>Logger: error: 使用 ignore 回退并记录 hex 预览 (line_num)
            else 无可用解码
                Decoder->>Logger: error: 无可用解码,记录 hex 预览 (line_num)
                Decoder->>Reader: 跳过或追加占位(依逻辑)
            end
        end
    end
Loading
sequenceDiagram
    participant Predictor as 预测流程
    participant Post as 后处理 (postprocess)
    participant Saver as 调试图像保存
    participant Logger as 日志系统

    Predictor->>Post: 调用后处理
    alt 后处理成功
        Post-->>Predictor: 返回结果
    else 后处理抛错
        Post-->>Predictor: 异常
        Predictor->>Saver: 创建 `ocr_error_images` 并保存批次图像 (带时间戳/批次号)
        Saver->>Logger: 记录保存目录与文件名
        Predictor->>Logger: 记录异常详细信息并重抛
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 分钟

Poem

🐰 我闻字里字外风,
行号为伴夜灯中。
UTF‑8 与 GBK 试探从容,
若有错落我也把它藏匿,
把错图存下,为明晨重逢。

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed 标题"fix: 修复ocr编码问题"(修复OCR编码问题)准确地反映了变更集的主要目的。在rec_postprocess.py中,代码的核心改动是实现了UTF-8和GBK的候选解码逻辑,以解决字符字典读取中的编码问题。在predict_rec.py中,添加的try/except包装支持了这个编码修复,并提供了调试能力。标题简洁、清晰且具体,能够让团队成员快速理解这是一个编码问题的修复。
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
src/onnxocr/rec_postprocess.py (1)

25-31: 编码错误处理逻辑合理,建议增加日志以提升可观测性。

这个级联的编码尝试方案很好地解决了字符字典文件可能使用不同编码的问题。UTF-8 → GBK → UTF-8(ignore) 的降级策略对于中文项目来说是合适的。

不过有以下建议:

  1. 增加日志记录:当使用 GBK 或 errors="ignore" 降级方案时,建议记录警告日志,这样可以:

    • 帮助发现字典文件的编码问题
    • 在 errors="ignore" 静默丢弃字符时提供可见性
    • 便于后续调试和维护
  2. 注意潜在的字符丢失:最后一层使用 errors="ignore" 会静默丢弃无法解码的字符,这可能影响 OCR 准确性。如果字典文件使用了其他编码(如 Big5、Shift-JIS 等),可能会导致字符映射错误。

参考实现:

                 for line in lines:
                     try:
                         line = line.decode("utf-8").strip("\n").strip("\r\n")
                     except UnicodeDecodeError:
+                        import logging
+                        logger = logging.getLogger(__name__)
                         try:
+                            logger.warning(f"Failed to decode line with UTF-8, trying GBK encoding")
                             line = line.decode("gbk").strip("\n").strip("\r\n")
                         except UnicodeDecodeError:
+                            logger.warning(f"Failed to decode line with GBK, using UTF-8 with ignore. Some characters may be lost.")
                             line = line.decode("utf-8", errors="ignore").strip("\n").strip("\r\n")
                     self.character_str.append(line)

可选改进:如果项目需要支持更多编码格式,可以考虑使用 chardet 库自动检测编码,但对于当前场景,现有方案已经足够。

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7397931 and e0087cf.

📒 Files selected for processing (1)
  • src/onnxocr/rec_postprocess.py (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: test-check

@A-nony-mous A-nony-mous force-pushed the fix/unicode-decode-error branch from e0087cf to 275f32b Compare October 20, 2025 12:50
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
src/onnxocr/rec_postprocess.py (1)

31-38: 建议在 GBK 解码成功时也记录日志。

当 UTF-8 解码失败但 GBK 解码成功时(第 37 行),代码没有记录任何日志。这可能会让调试编码问题变得困难,因为你无法知道哪些行使用了 GBK 解码。

考虑应用此修改以在 GBK 回退成功时记录信息:

                     try:
                         decoded_line = line.decode("utf-8").strip("\n").strip("\r\n")
                     except UnicodeDecodeError:
                         # 尝试 GBK 解码
                         try:
                             decoded_line = line.decode("gbk").strip("\n").strip("\r\n")
+                            log.info(f"字典文件使用 GBK 解码 [{character_dict_path}:{line_num}]")
                         except UnicodeDecodeError:
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e0087cf and 275f32b.

📒 Files selected for processing (1)
  • src/onnxocr/rec_postprocess.py (2 hunks)
🧰 Additional context used
🪛 Ruff (0.14.0)
src/onnxocr/rec_postprocess.py

42-42: String contains ambiguous (FULLWIDTH COMMA). Did you mean , (COMMA)?

(RUF001)


42-42: String contains ambiguous (FULLWIDTH COMMA). Did you mean , (COMMA)?

(RUF001)


43-43: Do not catch blind exception: Exception

(BLE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: test-check
🔇 Additional comments (2)
src/onnxocr/rec_postprocess.py (2)

8-9: 导入日志工具是必要的。

为新增的错误日志功能添加日志工具导入是合理的,代码格式也很清晰。


26-48: 整体的编码回退策略设计合理。

UTF-8 → GBK → UTF-8(ignore) 的三层回退策略能够很好地处理混合编码的字典文件,对于中文 OCR 场景特别适用。行号跟踪和十六进制预览也便于调试。

Comment on lines +39 to +45
# 最后使用 ignore 模式并记录日志
try:
decoded_line = line.decode("utf-8", errors="ignore").strip("\n").strip("\r\n")
log.error(f"字典文件解码失败,使用 ignore 模式 [{character_dict_path}:{line_num}],行内容(hex): {line[:50].hex()}...")
except Exception as e:
log.error(f"字典文件无法解码 [{character_dict_path}:{line_num}]: {e}")
continue
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

修复静态分析警告并改进异常处理。

此代码段存在以下问题:

  1. 第 42 行:错误消息中使用了全角逗号(,)而不是半角逗号。静态分析工具建议使用标准逗号以保持一致性。
  2. 第 43 行:捕获通用的 Exception 是一个不好的做法。应该捕获更具体的异常类型(如 UnicodeDecodeError),或者如果这是最后的兜底逻辑,至少在注释中说明原因。

考虑应用以下修改:

                             try:
                                 decoded_line = line.decode("utf-8", errors="ignore").strip("\n").strip("\r\n")
-                                log.error(f"字典文件解码失败,使用 ignore 模式 [{character_dict_path}:{line_num}],行内容(hex): {line[:50].hex()}...")
-                            except Exception as e:
+                                log.error(f"字典文件解码失败, 使用 ignore 模式 [{character_dict_path}:{line_num}], 行内容(hex): {line[:50].hex()}...")
+                            except (UnicodeDecodeError, AttributeError) as e:
                                 log.error(f"字典文件无法解码 [{character_dict_path}:{line_num}]: {e}")
                                 continue
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# 最后使用 ignore 模式并记录日志
try:
decoded_line = line.decode("utf-8", errors="ignore").strip("\n").strip("\r\n")
log.error(f"字典文件解码失败使用 ignore 模式 [{character_dict_path}:{line_num}]行内容(hex): {line[:50].hex()}...")
except Exception as e:
log.error(f"字典文件无法解码 [{character_dict_path}:{line_num}]: {e}")
continue
# 最后使用 ignore 模式并记录日志
try:
decoded_line = line.decode("utf-8", errors="ignore").strip("\n").strip("\r\n")
log.error(f"字典文件解码失败, 使用 ignore 模式 [{character_dict_path}:{line_num}], 行内容(hex): {line[:50].hex()}...")
except (UnicodeDecodeError, AttributeError) as e:
log.error(f"字典文件无法解码 [{character_dict_path}:{line_num}]: {e}")
continue
🧰 Tools
🪛 Ruff (0.14.0)

42-42: String contains ambiguous (FULLWIDTH COMMA). Did you mean , (COMMA)?

(RUF001)


42-42: String contains ambiguous (FULLWIDTH COMMA). Did you mean , (COMMA)?

(RUF001)


43-43: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
In src/onnxocr/rec_postprocess.py around lines 39 to 45, the error logging and
exception handling need tightening: replace the full-width comma in the log
message with a standard ASCII comma and update the except to catch
UnicodeDecodeError (for decode failures) so you don't catch all Exceptions; if
you need a final fallback, add a separate broad except with a brief comment
explaining it's a last-resort catch and re-use the existing log.error with the
exception details, then continue on error as before.

@ShadowLemoon
Copy link
Copy Markdown
Collaborator

log会不会太频繁

@A-nony-mous
Copy link
Copy Markdown
Contributor Author

log 会不会太频繁

太频繁说明字符集有问题 要换字符集

@DoctorReid
Copy link
Copy Markdown
Collaborator

有可以稳定重现的样例吗?

@A-nony-mous
Copy link
Copy Markdown
Contributor Author

Image_1761020222741

@DoctorReid
Copy link
Copy Markdown
Collaborator

我是说,那张出现问题的截图,可以确保这个修改是有效的

@A-nony-mous
Copy link
Copy Markdown
Contributor Author

目前不能稳定复现了

@DoctorReid
Copy link
Copy Markdown
Collaborator

可以加一层try catch,出问题的时候将截图保存下来。

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
src/onnxocr/rec_postprocess.py (2)

28-54: 多编码回退策略设计良好。

UTF-8 → GBK 的回退逻辑符合中文环境的实际需求,保存调试文件的做法便于问题排查。注意Line 54的全角逗号问题已在历史评审中标记。


56-62: 修复静态分析警告并改进异常处理。

此代码段存在的问题已在历史评审中详细指出:

  1. Line 59:日志消息中使用了全角逗号(,)而非半角逗号
  2. Line 60:捕获宽泛的 Exception 不是最佳实践,应捕获更具体的异常类型

请参考历史评审中的建议修改:

                             try:
                                 decoded_line = line.decode("utf-8", errors="ignore").strip("\n").strip("\r\n")
-                                log.error(f"字典文件解码失败,使用 ignore 模式 [{character_dict_path}:{line_num}],行内容(hex): {line[:50].hex()}...")
-                            except Exception as e:
+                                log.error(f"字典文件解码失败, 使用 ignore 模式 [{character_dict_path}:{line_num}], 行内容(hex): {line[:50].hex()}...")
+                            except (UnicodeDecodeError, AttributeError) as e:
                                 log.error(f"字典文件无法解码 [{character_dict_path}:{line_num}]: {e}")
                                 continue

根据历史评审意见

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 275f32b and d4182c2.

📒 Files selected for processing (2)
  • src/onnxocr/predict_rec.py (2 hunks)
  • src/onnxocr/rec_postprocess.py (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
src/onnxocr/predict_rec.py (1)
src/onnxocr/rec_postprocess.py (1)
  • CTCLabelDecode (131-152)
🪛 Ruff (0.14.1)
src/onnxocr/predict_rec.py

341-341: String contains ambiguous (FULLWIDTH COMMA). Did you mean , (COMMA)?

(RUF001)


341-341: String contains ambiguous (FULLWIDTH COMMA). Did you mean , (COMMA)?

(RUF001)

src/onnxocr/rec_postprocess.py

54-54: String contains ambiguous (FULLWIDTH COMMA). Did you mean , (COMMA)?

(RUF001)


54-54: String contains ambiguous (FULLWIDTH COMMA). Did you mean , (COMMA)?

(RUF001)


59-59: String contains ambiguous (FULLWIDTH COMMA). Did you mean , (COMMA)?

(RUF001)


59-59: String contains ambiguous (FULLWIDTH COMMA). Did you mean , (COMMA)?

(RUF001)


60-60: Do not catch blind exception: Exception

(BLE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: test-check
🔇 Additional comments (2)
src/onnxocr/predict_rec.py (1)

4-5: 导入语句合理。

新增的导入支持了错误处理和调试功能,与后续的异常处理逻辑配合良好。

Also applies to: 10-10

src/onnxocr/rec_postprocess.py (1)

7-10: 导入语句合理。

新增的导入支持了增强的字典解码逻辑和调试功能。

Comment on lines +323 to +342
try:
rec_result = self.postprocess_op(preds)
except Exception as e:
# 出现异常时保存图片以便调试
error_dir = os.path.join(os.getcwd(), "ocr_error_images")
os.makedirs(error_dir, exist_ok=True)

timestamp = datetime.now().strftime("%Y%m%d_%H%M%S_%f")

# 保存当前批次的所有图片
for idx, ino in enumerate(range(beg_img_no, end_img_no)):
img_to_save = img_list[indices[ino]]
error_img_path = os.path.join(
error_dir,
f"error_{timestamp}_batch{beg_img_no}_img{idx}.png"
)
cv2.imwrite(error_img_path, img_to_save)

log.error(f"OCR后处理失败,已保存图片到 {error_dir},错误: {e}", exc_info=True)
raise
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

修复日志消息中的全角标点符号。

错误处理逻辑设计合理,保存失败图片便于问题复现和调试。但日志消息中使用了全角逗号,应替换为半角逗号以保持一致性。

应用以下修改:

-                log.error(f"OCR后处理失败,已保存图片到 {error_dir},错误: {e}", exc_info=True)
+                log.error(f"OCR后处理失败, 已保存图片到 {error_dir}, 错误: {e}", exc_info=True)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
try:
rec_result = self.postprocess_op(preds)
except Exception as e:
# 出现异常时保存图片以便调试
error_dir = os.path.join(os.getcwd(), "ocr_error_images")
os.makedirs(error_dir, exist_ok=True)
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S_%f")
# 保存当前批次的所有图片
for idx, ino in enumerate(range(beg_img_no, end_img_no)):
img_to_save = img_list[indices[ino]]
error_img_path = os.path.join(
error_dir,
f"error_{timestamp}_batch{beg_img_no}_img{idx}.png"
)
cv2.imwrite(error_img_path, img_to_save)
log.error(f"OCR后处理失败已保存图片到 {error_dir}错误: {e}", exc_info=True)
raise
try:
rec_result = self.postprocess_op(preds)
except Exception as e:
# 出现异常时保存图片以便调试
error_dir = os.path.join(os.getcwd(), "ocr_error_images")
os.makedirs(error_dir, exist_ok=True)
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S_%f")
# 保存当前批次的所有图片
for idx, ino in enumerate(range(beg_img_no, end_img_no)):
img_to_save = img_list[indices[ino]]
error_img_path = os.path.join(
error_dir,
f"error_{timestamp}_batch{beg_img_no}_img{idx}.png"
)
cv2.imwrite(error_img_path, img_to_save)
log.error(f"OCR后处理失败, 已保存图片到 {error_dir}, 错误: {e}", exc_info=True)
raise
🧰 Tools
🪛 Ruff (0.14.1)

341-341: String contains ambiguous (FULLWIDTH COMMA). Did you mean , (COMMA)?

(RUF001)


341-341: String contains ambiguous (FULLWIDTH COMMA). Did you mean , (COMMA)?

(RUF001)

🤖 Prompt for AI Agents
In src/onnxocr/predict_rec.py around lines 323 to 342, the error log message
uses full-width (Chinese) commas; update the log string to use standard ASCII
commas instead. Replace the full-width commas in the f-string passed to
log.error with half-width commas so the message becomes e.g. "OCR后处理失败, 已保存图片到
{error_dir}, 错误: {e}" and keep exc_info=True and the subsequent raise unchanged.

@A-nony-mous
Copy link
Copy Markdown
Contributor Author

#2106

@github-actions github-actions Bot locked as resolved and limited conversation to collaborators Mar 20, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants