微信公众号文章爬虫
微信公众号文章爬虫 — scrapes published articles from WeChat Official Accounts via the mp.weixin.qq.com backend API. Single-file Python CLI tool (wechat_crawler.py).
# Install dependencies
pip install -r requirements.txt
# Search for an account (get fakeid)
python wechat_crawler.py --cookie "..." --token "..." --search "公众号名称"
# Crawl articles
python wechat_crawler.py --cookie "..." --token "..." --fakeid "..." --max 50
# Crawl with article body content, output as CSV
python wechat_crawler.py --cookie "..." --token "..." --fakeid "..." --content --format csvCredentials can also be passed via environment variables WX_COOKIE and WX_TOKEN.
All logic lives in wechat_crawler.py:
WeChatCrawler— core class. Holds arequests.Sessionauthenticated with cookie/token.search_account()→ calls/cgi-bin/searchbizto find accounts by name, returns fakeidget_articles()/get_all_articles()→ calls/cgi-bin/appmsgpublishto paginate through published articles (page size 5, 3–8s random delay between pages)get_article_content()→ fetches article HTML and extracts body text via regexsave_to_json()/save_to_csv()→ output serialization
main()— CLI entry point with manual arg parsing (no argparse)
- Article list endpoint returns
publish_listwhere each item'spublish_infois a JSON string that must be parsed withjson.loads() - Each publish can contain multiple articles (多图文) in
appmsgexarray - Rate limit error code
200013triggers a 60-second backoff - Cookie/token are session-scoped and expire; must be re-obtained from browser
- Python 3.10+ (uses
list[dict]andtuple[float, float]type hints) - Only dependency:
requests - Console output uses bracketed prefixes:
[信息],[错误],[进度],[等待],[保存]