Extend `feedparser.parse()` with `archive_url_data:bool` and `request_hooks` by jvanasco · Pull Request #533 · kurtmckee/feedparser

jvanasco · 2025-11-06T23:12:14Z

We needed a way to archive the data that feedparser uses when processing a url, for the purposes of troubleshooting, running tests and regression analysis.

There were two options to achieve that:

1- Download the URL ourselves, then parse that with feedparser.
2- Extend feedparser to save the "raw" data

This PR is a quick attempt at the latter, as the utility to handle this in troubleshooting is widely applicable:

introduces archive_url_data:bool to feedparser.parse.
if set, a .raw attribute on the result FeedParserDict will contain the "content" and "headers"
headers are copied to this BEFORE they are updated by kwargs
extends feedparser.api._open_resource to return the "type" of data accessed, in addition to the data
Additionally, request_hooks are added to parse. This is a dict containing "hooks" to pass on to "requests.get" for customization. It also supports a "response.postprocess" hook, which is not passed on to requests - and can be used to operate on the response before it is lost. This allows for capturing the actual IP address of the remote server, as shown below. (The response_peername__hook needs to execute before content is read from the connection.)

I'm happy to achieve this other ways and work towards an acceptable PR - I'd just like to ensure there is a way to access/operate the raw data feedparser natively pulls out. We've had issues due to networking/round-robin-dns and throttling that are best identified and only solved by examining this info.

import typing

import feedparser
from feedparser.http import RequestHooks

from metadata_parser.requests_extensions import response_peername__hook

if typing.TYPE_CHECKING:
    from requests import Response

    from feedparser.util import FeedParserDict

def process_result(response: "Response", result: "FeedParserDict") -> None:
    result.raw["peername"] = response._mp_peername


request_hooks: RequestHooks = {
    "response": response_peername__hook,
    "response.postprocess": process_result,
}

feed = feedparser.parse(
    "https://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml",
    archive_url_data=True,
    requests_hooks=request_hooks,
)

print("Feed was downloaded from:", feed.raw["peername"])

Fixes: #289

for more information, see https://pre-commit.ci

jvanasco and others added 5 commits November 6, 2025 16:48

introduce archive_url_data to feedparser.parse

406a02e

supporting requests hooks

9d7bf4e

[pre-commit.ci] auto fixes from pre-commit.com hooks

034c5e1

for more information, see https://pre-commit.ci

fix changes from pre-commit.ci -- how did those even happen?!?

b5b6d11

NotRequired is not available on python 3.10

a7a03d6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Extend `feedparser.parse()` with `archive_url_data:bool` and `request_hooks`#533

Extend `feedparser.parse()` with `archive_url_data:bool` and `request_hooks`#533
jvanasco wants to merge 5 commits intokurtmckee:mainfrom
jvanasco:feature-save_downloads

jvanasco commented Nov 6, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

jvanasco commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jvanasco commented Nov 6, 2025 •

edited

Loading