Skip to content

fix: TorridgeDistrictCouncil - parse new relative-date SOAP format#1932

Open
InertiaUK wants to merge 1 commit intorobbrad:masterfrom
InertiaUK:fix/TorridgeDistrictCouncil
Open

fix: TorridgeDistrictCouncil - parse new relative-date SOAP format#1932
InertiaUK wants to merge 1 commit intorobbrad:masterfrom
InertiaUK:fix/TorridgeDistrictCouncil

Conversation

@InertiaUK
Copy link
Copy Markdown

@InertiaUK InertiaUK commented Apr 8, 2026

What this changes

Torridge's SOAP endpoint (getRoundCalendarForUPRN) now returns a completely different payload. Where it used to return explicit dates like Mon 14 Apr, it now returns relative phrases plus an embedded calendar table:

<b>Refuse</b>: Tomorrow then every Mon<br>
<b>Recycling</b>: No Recycling waste collection for this address<br>
...
<table id="CalTab00" class="newCalendarTable">...</table>

The old regex ([A-Za-z]+ \d\d? [A-Za-z]+) returns nothing against "Tomorrow then every Mon", so parse_data silently produced an empty bins list.

What the new parser does

  • Recognises relative phrases: Today, Tomorrow, and <Weekday> (e.g. every Mon -> next Monday).
  • Still handles the old explicit Mon 14 Apr format as a fallback, in case Torridge ever reverts.
  • Gracefully skips lines like No Recycling waste collection for this address (some UPRNs only have a subset of services).
  • Keeps the SOAPAction + charset header from the earlier iteration of this PR.

Testing

Verified via the VPS wrapper against UPRN 10091078762:

{
  "bins": [
    { "type": "Refuse", "collectionDate": "13/04/2026" }
  ]
}

13/04/2026 is the Monday following today (12/04/2026), which matches the server's textual Tomorrow then every Mon hint.

Supersedes

This replaces the previous commit on this branch (header + regex tidy). Force-pushed.

Summary by CodeRabbit

  • Bug Fixes
    • Improved reliability of bin collection data parsing for Torridge District Council.
    • Enhanced handling of collection dates, including support for relative dates (today, tomorrow) and weekday names.
    • Strengthened input validation for required parameters.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 8, 2026

📝 Walkthrough

Walkthrough

Updated Torridge District Council's bin collection parser to improve SOAP request handling by adding SOAPAction header, tightened UPRN validation, and refactored bin parsing from hardcoded bin types to generic extraction from bold HTML elements with advanced filtering and date normalization logic.

Changes

Cohort / File(s) Summary
SOAP Request/Response & Bin Parsing Refactor
uk_bin_collection/uk_bin_collection/councils/TorridgeDistrictCouncil.py
Enhanced SOAP request handling with SOAPAction header; stricter UPRN validation via explicit ValueError; replaced conditional bin-type branches with generic loop over bold elements; added filtering logic to skip year-like patterns and invalid entries; implemented new _extract_base_date() method supporting relative dates (today/tomorrow), explicit day/month/year patterns, and weekday phrases with fallback.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 Bold elements now dance in a circular way,
Instead of hardcoded bins from yesterday,
SOAP headers fly and dates normalize true,
Generic parsing brings a fresh morning dew,
Tomorrow and today both find their place anew! 📅✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately summarizes the main change: fixing Torridge's SOAP parser to handle the new relative-date format, with SOAPAction header support.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Torridge's SOAP API changed its response from explicit dates ("Mon 14 Apr")
to relative phrases ("Tomorrow then every Mon", "Today then every Tue")
with an embedded calendar table. The old regex-based parser returned empty
because it expected the old format.

The rewritten parser handles Today/Tomorrow/weekday-name phrases, falls back
to the old explicit format if it reappears, and gracefully skips "No X
collection for this address" entries.
@InertiaUK InertiaUK force-pushed the fix/TorridgeDistrictCouncil branch from b5ca9e1 to f1caa13 Compare April 12, 2026 10:35
@InertiaUK InertiaUK changed the title fix: TorridgeDistrictCouncil - add SOAPAction header and tighten date regex fix: TorridgeDistrictCouncil - parse new relative-date SOAP format Apr 12, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@uk_bin_collection/uk_bin_collection/councils/TorridgeDistrictCouncil.py`:
- Around line 62-64: The code currently sets inner_html = result.text if result
is not None else "" which hides missing SOAP nodes; change this to fail fast by
checking the SOAP response node (result) and its text before creating the
BeautifulSoup. Specifically, in the parsing routine where variables result and
inner_html are used (the block that assigns inner_html and then calls
BeautifulSoup), raise an explicit exception (e.g., ValueError or a custom
ParseError) when result is None or result.text is empty/None with a clear
message indicating the missing SOAP result for TorridgeDistrictCouncil, instead
of defaulting to an empty string; this will surface upstream format changes
immediately.
- Around line 55-56: The code parses untrusted SOAP XML using
xml.etree.ElementTree.fromstring (vulnerable to XXE/billion-laughs); add
defusedxml to project dependencies (pyproject.toml) and replace imports/usages
of ElementTree in TorridgeDistrictCouncil.py to use defusedxml.ElementTree
(e.g., update the import and keep calling
defusedxml.ElementTree.fromstring(dom_text) and dom.find(...) as before) so
parsing uses the safe defusedxml implementation.
- Around line 49-50: The POST call at resp = requests.post(url, headers=headers,
data=post_data, verify=False) should include a timeout and explicit HTTP error
handling: add a timeout argument (e.g. timeout=10) to the requests.post call,
immediately call resp.raise_for_status() after the response to turn non-2xx
status codes into exceptions, and wrap the request/raise in a try/except that
catches requests.exceptions.RequestException to log or propagate a clear error
(including status/code/message) instead of allowing downstream XML parsing to
fail; update the function containing this call (the method with resp =
requests.post) to handle the exception path cleanly (log and return/raise a
descriptive error).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1ffbf182-90ca-4973-955e-02c2c052c3ce

📥 Commits

Reviewing files that changed from the base of the PR and between 60bd3cc and f1caa13.

📒 Files selected for processing (1)
  • uk_bin_collection/uk_bin_collection/councils/TorridgeDistrictCouncil.py

Comment on lines +49 to 50
resp = requests.post(url, headers=headers, data=post_data, verify=False)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify SOAP calls in council parsers and whether timeout / raise_for_status are present
rg -n --type=py -C2 'requests\.post\(' uk_bin_collection/uk_bin_collection/councils
rg -n --type=py -C2 'raise_for_status\(' uk_bin_collection/uk_bin_collection/councils

Repository: robbrad/UKBinCollectionData

Length of output: 50383


🏁 Script executed:

cat -n uk_bin_collection/uk_bin_collection/councils/TorridgeDistrictCouncil.py | sed -n '40,70p'

Repository: robbrad/UKBinCollectionData

Length of output: 1554


Add timeout and HTTP error handling to the SOAP request.

Line 49 performs a network call without timeout and without raise_for_status(). This can hang indefinitely and fail with confusing XML parsing errors instead of a clear HTTP error.

Suggested patch
-        resp = requests.post(url, headers=headers, data=post_data, verify=False)
+        try:
+            resp = requests.post(
+                url,
+                headers=headers,
+                data=post_data,
+                verify=False,
+                timeout=30,
+            )
+            resp.raise_for_status()
+        except requests.RequestException as ex:
+            raise ValueError("Failed to retrieve bin data from Torridge SOAP endpoint") from ex
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
resp = requests.post(url, headers=headers, data=post_data, verify=False)
try:
resp = requests.post(
url,
headers=headers,
data=post_data,
verify=False,
timeout=30,
)
resp.raise_for_status()
except requests.RequestException as ex:
raise ValueError("Failed to retrieve bin data from Torridge SOAP endpoint") from ex
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@uk_bin_collection/uk_bin_collection/councils/TorridgeDistrictCouncil.py`
around lines 49 - 50, The POST call at resp = requests.post(url,
headers=headers, data=post_data, verify=False) should include a timeout and
explicit HTTP error handling: add a timeout argument (e.g. timeout=10) to the
requests.post call, immediately call resp.raise_for_status() after the response
to turn non-2xx status codes into exceptions, and wrap the request/raise in a
try/except that catches requests.exceptions.RequestException to log or propagate
a clear error (including status/code/message) instead of allowing downstream XML
parsing to fail; update the function containing this call (the method with resp
= requests.post) to handle the exception path cleanly (log and return/raise a
descriptive error).

Comment on lines +55 to +56
dom = ElementTree.fromstring(resp.text)
result = dom.find(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify current XML parser usage and whether defusedxml is already used
rg -n --type=py -C2 'from xml\.etree import ElementTree|ElementTree\.fromstring\('
rg -n --type=py -C2 'defusedxml'
fd -i 'pyproject.toml|requirements*.txt|setup.py|Pipfile|poetry.lock'

Repository: robbrad/UKBinCollectionData

Length of output: 1919


🏁 Script executed:

cat pyproject.toml

Repository: robbrad/UKBinCollectionData

Length of output: 2145


🏁 Script executed:

# Check for defusedxml in poetry.lock
rg -n 'defusedxml' poetry.lock | head -20

Repository: robbrad/UKBinCollectionData

Length of output: 671


🏁 Script executed:

rg -n 'ElementTree\.fromstring' --type=py

Repository: robbrad/UKBinCollectionData

Length of output: 320


Add defusedxml dependency and use it for parsing remote SOAP XML to prevent XML attacks.

Line 55 parses untrusted remote SOAP XML with xml.etree.ElementTree.fromstring, which is vulnerable to XML attacks (XXE, billion laughs). Replace with defusedxml.ElementTree. First, add defusedxml to pyproject.toml dependencies, then update the import:

Suggested patch
-from xml.etree import ElementTree
+from defusedxml import ElementTree
🧰 Tools
🪛 Ruff (0.15.9)

[error] 55-55: Using xml to parse untrusted data is known to be vulnerable to XML attacks; use defusedxml equivalents

(S314)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@uk_bin_collection/uk_bin_collection/councils/TorridgeDistrictCouncil.py`
around lines 55 - 56, The code parses untrusted SOAP XML using
xml.etree.ElementTree.fromstring (vulnerable to XXE/billion-laughs); add
defusedxml to project dependencies (pyproject.toml) and replace imports/usages
of ElementTree in TorridgeDistrictCouncil.py to use defusedxml.ElementTree
(e.g., update the import and keep calling
defusedxml.ElementTree.fromstring(dom_text) and dom.find(...) as before) so
parsing uses the safe defusedxml implementation.

Comment on lines +62 to +64
inner_html = result.text if result is not None else ""

soup = BeautifulSoup(inner_html, features="html.parser")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fail fast if the SOAP result node is missing or empty.

Line 62 silently defaults to "", which can mask upstream format changes as “no bins”. Raise an explicit exception when result/result.text is missing.

Suggested patch
-        inner_html = result.text if result is not None else ""
+        if result is None or not result.text:
+            raise ValueError(
+                "Torridge SOAP response missing getRoundCalendarForUPRNResult content"
+            )
+        inner_html = result.text

Based on learnings: In uk_bin_collection/**/*.py, when parsing council bin collection data, prefer explicit failures over silent defaults or swallowed errors.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
inner_html = result.text if result is not None else ""
soup = BeautifulSoup(inner_html, features="html.parser")
if result is None or not result.text:
raise ValueError(
"Torridge SOAP response missing getRoundCalendarForUPRNResult content"
)
inner_html = result.text
soup = BeautifulSoup(inner_html, features="html.parser")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@uk_bin_collection/uk_bin_collection/councils/TorridgeDistrictCouncil.py`
around lines 62 - 64, The code currently sets inner_html = result.text if result
is not None else "" which hides missing SOAP nodes; change this to fail fast by
checking the SOAP response node (result) and its text before creating the
BeautifulSoup. Specifically, in the parsing routine where variables result and
inner_html are used (the block that assigns inner_html and then calls
BeautifulSoup), raise an explicit exception (e.g., ValueError or a custom
ParseError) when result is None or result.text is empty/None with a clear
message indicating the missing SOAP result for TorridgeDistrictCouncil, instead
of defaulting to an empty string; this will surface upstream format changes
immediately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant