fix: TorridgeDistrictCouncil - parse new relative-date SOAP format#1932
fix: TorridgeDistrictCouncil - parse new relative-date SOAP format#1932InertiaUK wants to merge 1 commit intorobbrad:masterfrom
Conversation
📝 WalkthroughWalkthroughUpdated Torridge District Council's bin collection parser to improve SOAP request handling by adding SOAPAction header, tightened UPRN validation, and refactored bin parsing from hardcoded bin types to generic extraction from bold HTML elements with advanced filtering and date normalization logic. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Torridge's SOAP API changed its response from explicit dates ("Mon 14 Apr")
to relative phrases ("Tomorrow then every Mon", "Today then every Tue")
with an embedded calendar table. The old regex-based parser returned empty
because it expected the old format.
The rewritten parser handles Today/Tomorrow/weekday-name phrases, falls back
to the old explicit format if it reappears, and gracefully skips "No X
collection for this address" entries.
b5ca9e1 to
f1caa13
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@uk_bin_collection/uk_bin_collection/councils/TorridgeDistrictCouncil.py`:
- Around line 62-64: The code currently sets inner_html = result.text if result
is not None else "" which hides missing SOAP nodes; change this to fail fast by
checking the SOAP response node (result) and its text before creating the
BeautifulSoup. Specifically, in the parsing routine where variables result and
inner_html are used (the block that assigns inner_html and then calls
BeautifulSoup), raise an explicit exception (e.g., ValueError or a custom
ParseError) when result is None or result.text is empty/None with a clear
message indicating the missing SOAP result for TorridgeDistrictCouncil, instead
of defaulting to an empty string; this will surface upstream format changes
immediately.
- Around line 55-56: The code parses untrusted SOAP XML using
xml.etree.ElementTree.fromstring (vulnerable to XXE/billion-laughs); add
defusedxml to project dependencies (pyproject.toml) and replace imports/usages
of ElementTree in TorridgeDistrictCouncil.py to use defusedxml.ElementTree
(e.g., update the import and keep calling
defusedxml.ElementTree.fromstring(dom_text) and dom.find(...) as before) so
parsing uses the safe defusedxml implementation.
- Around line 49-50: The POST call at resp = requests.post(url, headers=headers,
data=post_data, verify=False) should include a timeout and explicit HTTP error
handling: add a timeout argument (e.g. timeout=10) to the requests.post call,
immediately call resp.raise_for_status() after the response to turn non-2xx
status codes into exceptions, and wrap the request/raise in a try/except that
catches requests.exceptions.RequestException to log or propagate a clear error
(including status/code/message) instead of allowing downstream XML parsing to
fail; update the function containing this call (the method with resp =
requests.post) to handle the exception path cleanly (log and return/raise a
descriptive error).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 1ffbf182-90ca-4973-955e-02c2c052c3ce
📒 Files selected for processing (1)
uk_bin_collection/uk_bin_collection/councils/TorridgeDistrictCouncil.py
| resp = requests.post(url, headers=headers, data=post_data, verify=False) | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify SOAP calls in council parsers and whether timeout / raise_for_status are present
rg -n --type=py -C2 'requests\.post\(' uk_bin_collection/uk_bin_collection/councils
rg -n --type=py -C2 'raise_for_status\(' uk_bin_collection/uk_bin_collection/councilsRepository: robbrad/UKBinCollectionData
Length of output: 50383
🏁 Script executed:
cat -n uk_bin_collection/uk_bin_collection/councils/TorridgeDistrictCouncil.py | sed -n '40,70p'Repository: robbrad/UKBinCollectionData
Length of output: 1554
Add timeout and HTTP error handling to the SOAP request.
Line 49 performs a network call without timeout and without raise_for_status(). This can hang indefinitely and fail with confusing XML parsing errors instead of a clear HTTP error.
Suggested patch
- resp = requests.post(url, headers=headers, data=post_data, verify=False)
+ try:
+ resp = requests.post(
+ url,
+ headers=headers,
+ data=post_data,
+ verify=False,
+ timeout=30,
+ )
+ resp.raise_for_status()
+ except requests.RequestException as ex:
+ raise ValueError("Failed to retrieve bin data from Torridge SOAP endpoint") from ex📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| resp = requests.post(url, headers=headers, data=post_data, verify=False) | |
| try: | |
| resp = requests.post( | |
| url, | |
| headers=headers, | |
| data=post_data, | |
| verify=False, | |
| timeout=30, | |
| ) | |
| resp.raise_for_status() | |
| except requests.RequestException as ex: | |
| raise ValueError("Failed to retrieve bin data from Torridge SOAP endpoint") from ex |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@uk_bin_collection/uk_bin_collection/councils/TorridgeDistrictCouncil.py`
around lines 49 - 50, The POST call at resp = requests.post(url,
headers=headers, data=post_data, verify=False) should include a timeout and
explicit HTTP error handling: add a timeout argument (e.g. timeout=10) to the
requests.post call, immediately call resp.raise_for_status() after the response
to turn non-2xx status codes into exceptions, and wrap the request/raise in a
try/except that catches requests.exceptions.RequestException to log or propagate
a clear error (including status/code/message) instead of allowing downstream XML
parsing to fail; update the function containing this call (the method with resp
= requests.post) to handle the exception path cleanly (log and return/raise a
descriptive error).
| dom = ElementTree.fromstring(resp.text) | ||
| result = dom.find( |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify current XML parser usage and whether defusedxml is already used
rg -n --type=py -C2 'from xml\.etree import ElementTree|ElementTree\.fromstring\('
rg -n --type=py -C2 'defusedxml'
fd -i 'pyproject.toml|requirements*.txt|setup.py|Pipfile|poetry.lock'Repository: robbrad/UKBinCollectionData
Length of output: 1919
🏁 Script executed:
cat pyproject.tomlRepository: robbrad/UKBinCollectionData
Length of output: 2145
🏁 Script executed:
# Check for defusedxml in poetry.lock
rg -n 'defusedxml' poetry.lock | head -20Repository: robbrad/UKBinCollectionData
Length of output: 671
🏁 Script executed:
rg -n 'ElementTree\.fromstring' --type=pyRepository: robbrad/UKBinCollectionData
Length of output: 320
Add defusedxml dependency and use it for parsing remote SOAP XML to prevent XML attacks.
Line 55 parses untrusted remote SOAP XML with xml.etree.ElementTree.fromstring, which is vulnerable to XML attacks (XXE, billion laughs). Replace with defusedxml.ElementTree. First, add defusedxml to pyproject.toml dependencies, then update the import:
Suggested patch
-from xml.etree import ElementTree
+from defusedxml import ElementTree🧰 Tools
🪛 Ruff (0.15.9)
[error] 55-55: Using xml to parse untrusted data is known to be vulnerable to XML attacks; use defusedxml equivalents
(S314)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@uk_bin_collection/uk_bin_collection/councils/TorridgeDistrictCouncil.py`
around lines 55 - 56, The code parses untrusted SOAP XML using
xml.etree.ElementTree.fromstring (vulnerable to XXE/billion-laughs); add
defusedxml to project dependencies (pyproject.toml) and replace imports/usages
of ElementTree in TorridgeDistrictCouncil.py to use defusedxml.ElementTree
(e.g., update the import and keep calling
defusedxml.ElementTree.fromstring(dom_text) and dom.find(...) as before) so
parsing uses the safe defusedxml implementation.
| inner_html = result.text if result is not None else "" | ||
|
|
||
| soup = BeautifulSoup(inner_html, features="html.parser") |
There was a problem hiding this comment.
Fail fast if the SOAP result node is missing or empty.
Line 62 silently defaults to "", which can mask upstream format changes as “no bins”. Raise an explicit exception when result/result.text is missing.
Suggested patch
- inner_html = result.text if result is not None else ""
+ if result is None or not result.text:
+ raise ValueError(
+ "Torridge SOAP response missing getRoundCalendarForUPRNResult content"
+ )
+ inner_html = result.textBased on learnings: In uk_bin_collection/**/*.py, when parsing council bin collection data, prefer explicit failures over silent defaults or swallowed errors.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| inner_html = result.text if result is not None else "" | |
| soup = BeautifulSoup(inner_html, features="html.parser") | |
| if result is None or not result.text: | |
| raise ValueError( | |
| "Torridge SOAP response missing getRoundCalendarForUPRNResult content" | |
| ) | |
| inner_html = result.text | |
| soup = BeautifulSoup(inner_html, features="html.parser") |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@uk_bin_collection/uk_bin_collection/councils/TorridgeDistrictCouncil.py`
around lines 62 - 64, The code currently sets inner_html = result.text if result
is not None else "" which hides missing SOAP nodes; change this to fail fast by
checking the SOAP response node (result) and its text before creating the
BeautifulSoup. Specifically, in the parsing routine where variables result and
inner_html are used (the block that assigns inner_html and then calls
BeautifulSoup), raise an explicit exception (e.g., ValueError or a custom
ParseError) when result is None or result.text is empty/None with a clear
message indicating the missing SOAP result for TorridgeDistrictCouncil, instead
of defaulting to an empty string; this will surface upstream format changes
immediately.
What this changes
Torridge's SOAP endpoint (
getRoundCalendarForUPRN) now returns a completely different payload. Where it used to return explicit dates likeMon 14 Apr, it now returns relative phrases plus an embedded calendar table:The old regex
([A-Za-z]+ \d\d? [A-Za-z]+)returns nothing against"Tomorrow then every Mon", soparse_datasilently produced an emptybinslist.What the new parser does
Today,Tomorrow, and<Weekday>(e.g.every Mon-> next Monday).Mon 14 Aprformat as a fallback, in case Torridge ever reverts.No Recycling waste collection for this address(some UPRNs only have a subset of services).Testing
Verified via the VPS wrapper against UPRN
10091078762:13/04/2026 is the Monday following today (12/04/2026), which matches the server's textual
Tomorrow then every Monhint.Supersedes
This replaces the previous commit on this branch (header + regex tidy). Force-pushed.
Summary by CodeRabbit