fix: 修复ocr编码问题#1577
Conversation
Walkthrough扩展 Changes
Sequence Diagram(s)sequenceDiagram
participant Reader as 字典读取器
participant Decoder as 解码流程
participant Logger as 日志系统
Reader->>Decoder: 读取原始二进制行 (包含 line_num)
alt UTF-8 成功
Decoder->>Reader: 返回 decoded_line (UTF-8)
Decoder->>Logger: info: 行解码成功 (line_num)
else UTF-8 失败
Decoder->>Decoder: 尝试 GBK 解码
alt GBK 成功
Decoder->>Reader: 返回 decoded_line (GBK)
Decoder->>Logger: warning: 使用 GBK 回退 (line_num)
else GBK 失败
Decoder->>Decoder: 尝试 UTF-8 with errors="ignore"
alt 有可用内容
Decoder->>Reader: 返回 decoded_line (ignore 模式)
Decoder->>Logger: error: 使用 ignore 回退并记录 hex 预览 (line_num)
else 无可用解码
Decoder->>Logger: error: 无可用解码,记录 hex 预览 (line_num)
Decoder->>Reader: 跳过或追加占位(依逻辑)
end
end
end
sequenceDiagram
participant Predictor as 预测流程
participant Post as 后处理 (postprocess)
participant Saver as 调试图像保存
participant Logger as 日志系统
Predictor->>Post: 调用后处理
alt 后处理成功
Post-->>Predictor: 返回结果
else 后处理抛错
Post-->>Predictor: 异常
Predictor->>Saver: 创建 `ocr_error_images` 并保存批次图像 (带时间戳/批次号)
Saver->>Logger: 记录保存目录与文件名
Predictor->>Logger: 记录异常详细信息并重抛
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 分钟 Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
src/onnxocr/rec_postprocess.py (1)
25-31: 编码错误处理逻辑合理,建议增加日志以提升可观测性。这个级联的编码尝试方案很好地解决了字符字典文件可能使用不同编码的问题。UTF-8 → GBK → UTF-8(ignore) 的降级策略对于中文项目来说是合适的。
不过有以下建议:
增加日志记录:当使用 GBK 或 errors="ignore" 降级方案时,建议记录警告日志,这样可以:
- 帮助发现字典文件的编码问题
- 在 errors="ignore" 静默丢弃字符时提供可见性
- 便于后续调试和维护
注意潜在的字符丢失:最后一层使用
errors="ignore"会静默丢弃无法解码的字符,这可能影响 OCR 准确性。如果字典文件使用了其他编码(如 Big5、Shift-JIS 等),可能会导致字符映射错误。参考实现:
for line in lines: try: line = line.decode("utf-8").strip("\n").strip("\r\n") except UnicodeDecodeError: + import logging + logger = logging.getLogger(__name__) try: + logger.warning(f"Failed to decode line with UTF-8, trying GBK encoding") line = line.decode("gbk").strip("\n").strip("\r\n") except UnicodeDecodeError: + logger.warning(f"Failed to decode line with GBK, using UTF-8 with ignore. Some characters may be lost.") line = line.decode("utf-8", errors="ignore").strip("\n").strip("\r\n") self.character_str.append(line)可选改进:如果项目需要支持更多编码格式,可以考虑使用
chardet库自动检测编码,但对于当前场景,现有方案已经足够。
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/onnxocr/rec_postprocess.py(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: test-check
e0087cf to
275f32b
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
src/onnxocr/rec_postprocess.py (1)
31-38: 建议在 GBK 解码成功时也记录日志。当 UTF-8 解码失败但 GBK 解码成功时(第 37 行),代码没有记录任何日志。这可能会让调试编码问题变得困难,因为你无法知道哪些行使用了 GBK 解码。
考虑应用此修改以在 GBK 回退成功时记录信息:
try: decoded_line = line.decode("utf-8").strip("\n").strip("\r\n") except UnicodeDecodeError: # 尝试 GBK 解码 try: decoded_line = line.decode("gbk").strip("\n").strip("\r\n") + log.info(f"字典文件使用 GBK 解码 [{character_dict_path}:{line_num}]") except UnicodeDecodeError:
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/onnxocr/rec_postprocess.py(2 hunks)
🧰 Additional context used
🪛 Ruff (0.14.0)
src/onnxocr/rec_postprocess.py
42-42: String contains ambiguous , (FULLWIDTH COMMA). Did you mean , (COMMA)?
(RUF001)
42-42: String contains ambiguous , (FULLWIDTH COMMA). Did you mean , (COMMA)?
(RUF001)
43-43: Do not catch blind exception: Exception
(BLE001)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: test-check
🔇 Additional comments (2)
src/onnxocr/rec_postprocess.py (2)
8-9: 导入日志工具是必要的。为新增的错误日志功能添加日志工具导入是合理的,代码格式也很清晰。
26-48: 整体的编码回退策略设计合理。UTF-8 → GBK → UTF-8(ignore) 的三层回退策略能够很好地处理混合编码的字典文件,对于中文 OCR 场景特别适用。行号跟踪和十六进制预览也便于调试。
| # 最后使用 ignore 模式并记录日志 | ||
| try: | ||
| decoded_line = line.decode("utf-8", errors="ignore").strip("\n").strip("\r\n") | ||
| log.error(f"字典文件解码失败,使用 ignore 模式 [{character_dict_path}:{line_num}],行内容(hex): {line[:50].hex()}...") | ||
| except Exception as e: | ||
| log.error(f"字典文件无法解码 [{character_dict_path}:{line_num}]: {e}") | ||
| continue |
There was a problem hiding this comment.
修复静态分析警告并改进异常处理。
此代码段存在以下问题:
- 第 42 行:错误消息中使用了全角逗号(,)而不是半角逗号。静态分析工具建议使用标准逗号以保持一致性。
- 第 43 行:捕获通用的
Exception是一个不好的做法。应该捕获更具体的异常类型(如UnicodeDecodeError),或者如果这是最后的兜底逻辑,至少在注释中说明原因。
考虑应用以下修改:
try:
decoded_line = line.decode("utf-8", errors="ignore").strip("\n").strip("\r\n")
- log.error(f"字典文件解码失败,使用 ignore 模式 [{character_dict_path}:{line_num}],行内容(hex): {line[:50].hex()}...")
- except Exception as e:
+ log.error(f"字典文件解码失败, 使用 ignore 模式 [{character_dict_path}:{line_num}], 行内容(hex): {line[:50].hex()}...")
+ except (UnicodeDecodeError, AttributeError) as e:
log.error(f"字典文件无法解码 [{character_dict_path}:{line_num}]: {e}")
continue📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # 最后使用 ignore 模式并记录日志 | |
| try: | |
| decoded_line = line.decode("utf-8", errors="ignore").strip("\n").strip("\r\n") | |
| log.error(f"字典文件解码失败,使用 ignore 模式 [{character_dict_path}:{line_num}],行内容(hex): {line[:50].hex()}...") | |
| except Exception as e: | |
| log.error(f"字典文件无法解码 [{character_dict_path}:{line_num}]: {e}") | |
| continue | |
| # 最后使用 ignore 模式并记录日志 | |
| try: | |
| decoded_line = line.decode("utf-8", errors="ignore").strip("\n").strip("\r\n") | |
| log.error(f"字典文件解码失败, 使用 ignore 模式 [{character_dict_path}:{line_num}], 行内容(hex): {line[:50].hex()}...") | |
| except (UnicodeDecodeError, AttributeError) as e: | |
| log.error(f"字典文件无法解码 [{character_dict_path}:{line_num}]: {e}") | |
| continue |
🧰 Tools
🪛 Ruff (0.14.0)
42-42: String contains ambiguous , (FULLWIDTH COMMA). Did you mean , (COMMA)?
(RUF001)
42-42: String contains ambiguous , (FULLWIDTH COMMA). Did you mean , (COMMA)?
(RUF001)
43-43: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
In src/onnxocr/rec_postprocess.py around lines 39 to 45, the error logging and
exception handling need tightening: replace the full-width comma in the log
message with a standard ASCII comma and update the except to catch
UnicodeDecodeError (for decode failures) so you don't catch all Exceptions; if
you need a final fallback, add a separate broad except with a brief comment
explaining it's a last-resort catch and re-use the existing log.error with the
exception details, then continue on error as before.
|
log会不会太频繁 |
太频繁说明字符集有问题 要换字符集 |
|
有可以稳定重现的样例吗? |
|
我是说,那张出现问题的截图,可以确保这个修改是有效的 |
|
目前不能稳定复现了 |
|
可以加一层try catch,出问题的时候将截图保存下来。 |
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
src/onnxocr/rec_postprocess.py (2)
28-54: 多编码回退策略设计良好。UTF-8 → GBK 的回退逻辑符合中文环境的实际需求,保存调试文件的做法便于问题排查。注意Line 54的全角逗号问题已在历史评审中标记。
56-62: 修复静态分析警告并改进异常处理。此代码段存在的问题已在历史评审中详细指出:
- Line 59:日志消息中使用了全角逗号(,)而非半角逗号
- Line 60:捕获宽泛的
Exception不是最佳实践,应捕获更具体的异常类型请参考历史评审中的建议修改:
try: decoded_line = line.decode("utf-8", errors="ignore").strip("\n").strip("\r\n") - log.error(f"字典文件解码失败,使用 ignore 模式 [{character_dict_path}:{line_num}],行内容(hex): {line[:50].hex()}...") - except Exception as e: + log.error(f"字典文件解码失败, 使用 ignore 模式 [{character_dict_path}:{line_num}], 行内容(hex): {line[:50].hex()}...") + except (UnicodeDecodeError, AttributeError) as e: log.error(f"字典文件无法解码 [{character_dict_path}:{line_num}]: {e}") continue根据历史评审意见
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
src/onnxocr/predict_rec.py(2 hunks)src/onnxocr/rec_postprocess.py(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
src/onnxocr/predict_rec.py (1)
src/onnxocr/rec_postprocess.py (1)
CTCLabelDecode(131-152)
🪛 Ruff (0.14.1)
src/onnxocr/predict_rec.py
341-341: String contains ambiguous , (FULLWIDTH COMMA). Did you mean , (COMMA)?
(RUF001)
341-341: String contains ambiguous , (FULLWIDTH COMMA). Did you mean , (COMMA)?
(RUF001)
src/onnxocr/rec_postprocess.py
54-54: String contains ambiguous , (FULLWIDTH COMMA). Did you mean , (COMMA)?
(RUF001)
54-54: String contains ambiguous , (FULLWIDTH COMMA). Did you mean , (COMMA)?
(RUF001)
59-59: String contains ambiguous , (FULLWIDTH COMMA). Did you mean , (COMMA)?
(RUF001)
59-59: String contains ambiguous , (FULLWIDTH COMMA). Did you mean , (COMMA)?
(RUF001)
60-60: Do not catch blind exception: Exception
(BLE001)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: test-check
🔇 Additional comments (2)
src/onnxocr/predict_rec.py (1)
4-5: 导入语句合理。新增的导入支持了错误处理和调试功能,与后续的异常处理逻辑配合良好。
Also applies to: 10-10
src/onnxocr/rec_postprocess.py (1)
7-10: 导入语句合理。新增的导入支持了增强的字典解码逻辑和调试功能。
| try: | ||
| rec_result = self.postprocess_op(preds) | ||
| except Exception as e: | ||
| # 出现异常时保存图片以便调试 | ||
| error_dir = os.path.join(os.getcwd(), "ocr_error_images") | ||
| os.makedirs(error_dir, exist_ok=True) | ||
|
|
||
| timestamp = datetime.now().strftime("%Y%m%d_%H%M%S_%f") | ||
|
|
||
| # 保存当前批次的所有图片 | ||
| for idx, ino in enumerate(range(beg_img_no, end_img_no)): | ||
| img_to_save = img_list[indices[ino]] | ||
| error_img_path = os.path.join( | ||
| error_dir, | ||
| f"error_{timestamp}_batch{beg_img_no}_img{idx}.png" | ||
| ) | ||
| cv2.imwrite(error_img_path, img_to_save) | ||
|
|
||
| log.error(f"OCR后处理失败,已保存图片到 {error_dir},错误: {e}", exc_info=True) | ||
| raise |
There was a problem hiding this comment.
修复日志消息中的全角标点符号。
错误处理逻辑设计合理,保存失败图片便于问题复现和调试。但日志消息中使用了全角逗号,应替换为半角逗号以保持一致性。
应用以下修改:
- log.error(f"OCR后处理失败,已保存图片到 {error_dir},错误: {e}", exc_info=True)
+ log.error(f"OCR后处理失败, 已保存图片到 {error_dir}, 错误: {e}", exc_info=True)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| try: | |
| rec_result = self.postprocess_op(preds) | |
| except Exception as e: | |
| # 出现异常时保存图片以便调试 | |
| error_dir = os.path.join(os.getcwd(), "ocr_error_images") | |
| os.makedirs(error_dir, exist_ok=True) | |
| timestamp = datetime.now().strftime("%Y%m%d_%H%M%S_%f") | |
| # 保存当前批次的所有图片 | |
| for idx, ino in enumerate(range(beg_img_no, end_img_no)): | |
| img_to_save = img_list[indices[ino]] | |
| error_img_path = os.path.join( | |
| error_dir, | |
| f"error_{timestamp}_batch{beg_img_no}_img{idx}.png" | |
| ) | |
| cv2.imwrite(error_img_path, img_to_save) | |
| log.error(f"OCR后处理失败,已保存图片到 {error_dir},错误: {e}", exc_info=True) | |
| raise | |
| try: | |
| rec_result = self.postprocess_op(preds) | |
| except Exception as e: | |
| # 出现异常时保存图片以便调试 | |
| error_dir = os.path.join(os.getcwd(), "ocr_error_images") | |
| os.makedirs(error_dir, exist_ok=True) | |
| timestamp = datetime.now().strftime("%Y%m%d_%H%M%S_%f") | |
| # 保存当前批次的所有图片 | |
| for idx, ino in enumerate(range(beg_img_no, end_img_no)): | |
| img_to_save = img_list[indices[ino]] | |
| error_img_path = os.path.join( | |
| error_dir, | |
| f"error_{timestamp}_batch{beg_img_no}_img{idx}.png" | |
| ) | |
| cv2.imwrite(error_img_path, img_to_save) | |
| log.error(f"OCR后处理失败, 已保存图片到 {error_dir}, 错误: {e}", exc_info=True) | |
| raise |
🧰 Tools
🪛 Ruff (0.14.1)
341-341: String contains ambiguous , (FULLWIDTH COMMA). Did you mean , (COMMA)?
(RUF001)
341-341: String contains ambiguous , (FULLWIDTH COMMA). Did you mean , (COMMA)?
(RUF001)
🤖 Prompt for AI Agents
In src/onnxocr/predict_rec.py around lines 323 to 342, the error log message
uses full-width (Chinese) commas; update the log string to use standard ASCII
commas instead. Replace the full-width commas in the f-string passed to
log.error with half-width commas so the message becomes e.g. "OCR后处理失败, 已保存图片到
{error_dir}, 错误: {e}" and keep exc_info=True and the subsequent raise unchanged.

No description provided.
Summary by CodeRabbit
发布说明