AutoCCF is a Baidu Tieba data collection toolkit with a two-stage workflow for user post discovery and thread detail extraction.
AutoCCF 是一个面向百度贴吧的自动化数据采集工具,采用“两阶段”流程,覆盖用户发言索引采集与帖子详情归档。
AutoCCF is designed for controlled, reviewable data collection workflows. It separates fast user-level discovery from slower thread-level archival so teams can manage scope, credentials, and storage more carefully.
AutoCCF 面向可控、可审计的数据采集流程设计。它将快速的用户级发现与较慢的帖子级归档拆分开来,便于团队更谨慎地管理采集范围、凭证和存储。
- English:
APoU(All Posts of User) collects a user's public posting index quickly. - 中文:
APoU(All Posts of User)用于快速获取用户公开发言索引。
- English:
DoPJ(Detail of Posts JSON) fetches detailed thread content from APoU output. - 中文:
DoPJ(Detail of Posts JSON)基于 APoU 输出进一步抓取帖子详情内容。
- English: An Electron desktop interface is available for operational workflows.
- 中文:项目提供可选的 Electron 桌面界面,用于图形化操作流程。
This repository is intended only for lawful research, learning, internal tooling, and other authorized data processing scenarios.
本仓库仅适用于合法合规的研究、学习、内部工具建设及其他已获得授权的数据处理场景。
Before using this project, make sure all of the following are true:
在使用本项目之前,请确保以下条件全部成立:
- You have a lawful basis and sufficient authorization for the target data.
你对目标数据具备合法处理依据和充分授权。 - Your use complies with applicable laws, platform terms, and privacy obligations.
你的使用方式符合适用法律、平台条款和隐私义务。 - You collect only the minimum data necessary for the stated purpose.
你仅采集实现既定目的所必需的最小数据范围。 - You define retention, access control, and deletion rules for collected data.
你已经为采集数据定义保留、访问控制和删除规则。 - You do not use this project to bypass access restrictions, abuse controls, or platform security mechanisms.
你不会使用本项目绕过访问限制、风控机制或平台安全措施。
If any of the above cannot be satisfied, do not use this project.
如果无法满足以上任一条件,请勿使用本项目。
Collected forum content may contain personal data, user-generated content, identifiers, or other sensitive context.
采集到的论坛内容可能包含个人信息、用户生成内容、标识符或其他敏感上下文。
Recommended operating principles:
建议遵循以下操作原则:
- Minimize collection scope.
最小化采集范围。 - Avoid collecting accounts, users, or threads unrelated to your approved purpose.
避免采集与你的授权目的无关的账户、用户或帖子。 - Restrict internal access to raw outputs.
对原始输出实施最小权限访问控制。 - Remove or anonymize sensitive data before redistribution.
在再次分发前删除或匿名化敏感数据。 - Establish a deletion timeline for intermediate files and archived outputs.
为中间文件和归档输出建立明确的删除周期。
AutoCCF/
├── APoU.py
├── APoU/
├── DoPJ/
├── AutoCCF/
├── electron/
├── tests/
├── README.md
└── requirements.txt
- English: Retrieves a user's posting index quickly and writes a JSON list for downstream use.
- 中文:快速获取用户发言索引,并输出供后续处理使用的 JSON 列表。
- English: Reads APoU output and fetches detailed thread content with retry and concurrency controls.
- 中文:读取 APoU 输出,在重试与并发控制下进一步抓取帖子详情内容。
- English: Provides a Element GUI wrapper around the project workflow for local operation.
- 中文:为本地使用提供基于Element的图形界面封装。
- Python
3.10+ - A working network connection to required upstream services
- Optional: Node.js and npm for the Electron GUI
python APoU.py -u <用户名>Expected result: a user post index file suitable for downstream processing.
预期结果:生成可供后续处理的用户发言索引文件。
cd DoPJ
python DoPJ.py -i ../<用户名>_posts.json -c config/config.json -t 3Expected result: detailed thread data and related output files under the configured output directory.
预期结果:在配置的输出目录下生成帖子详情数据及相关输出文件。
cd electron
npm install
npm startTypical runtime output structure:
典型运行时输出结构如下:
database/
└── <username>/
├── apou/
│ └── posts.json
└── dopj/
├── index.json
└── <tid>/
Output paths may vary depending on configuration and legacy compatibility paths.
实际输出路径可能因配置和旧版兼容路径而有所不同。
All collected outputs should be treated as controlled data assets.
所有采集输出都应视为受控数据资产进行管理。
python -m py_compile APoU.py
python -m py_compile DoPJ/DoPJ.pypython -c "from APoU import UserPostsCrawler; print('APoU module OK')"
cd DoPJ
python -c "from utils.url_parser import parse_tieba_url; print(parse_tieba_url('https://tieba.baidu.com/p/123?pid=456'))"npx playwright test tests/test_ui/test_payload_e2e.js -g "Phase 2: APoU"- Upstream API behavior may change without notice.
上游 API 行为可能在无通知情况下发生变化。 - Platform risk control may reduce crawl success rate.
平台风控可能降低抓取成功率。 - Large-scale collection can be sensitive from both compliance and operational perspectives.
大规模采集在合规和运维层面都具有较高敏感性。
This project is licensed under the MIT License.
本项目采用 MIT License 许可协议。
This project references and builds on the ideas or implementations of the following repositories and services:
本项目在设计和实现上参考了以下仓库或服务:
- tb.anova.me - API source used by APoU.
- TiebaArchiver / TiebaScraper - referenced by DoPJ historical implementation.
- aiotieba - async Tieba API library used by this project.
- TiebaReader - companion reader tool for archived data.
If AutoCCF helps your work, please also cite or star the upstream projects above.
如果 AutoCCF 对你有帮助,也欢迎为上述上游项目引用或点亮 Star。
Use this project only within authorized, reviewable, and compliant boundaries.
请仅在获得授权、可接受审计且符合合规要求的边界内使用本项目。
Perform legal, privacy, and security review before any production or organization-wide deployment.
在任何生产环境或组织级部署之前,请先完成法务、隐私和安全审查。