fix: WyreCouncil - skip non-bin .boxed divs and harden date parsing#1952
fix: WyreCouncil - skip non-bin .boxed divs and harden date parsing#1952InertiaUK wants to merge 2 commits intorobbrad:masterfrom
Conversation
Torridge's SOAP API changed its response from explicit dates ("Mon 14 Apr")
to relative phrases ("Tomorrow then every Mon", "Today then every Tue")
with an embedded calendar table. The old regex-based parser returned empty
because it expected the old format.
The rewritten parser handles Today/Tomorrow/weekday-name phrases, falls back
to the old explicit format if it reappears, and gracefully skips "No X
collection for this address" entries.
The Wyre bin collection page now includes a 'Download calendar' box among the .boxed divs. This box has no h3.bin-collection-tasks__heading, which caused the scraper to crash on .text access. Changes: - Skip boxes missing the heading or content container - Use regex to extract the bin name from 'Your next X collection' - Collapse whitespace on the date text before strptime (p tags produced multi-whitespace runs after .text) - Roll year forward only when the computed date is in the past, instead of only handling the December->January edge case
📝 WalkthroughWalkthroughTwo council bin collection parsers receive updates to improve date parsing robustness. TorridgeDistrictCouncil refactors SOAP response handling to support relative date phrases and introduces a new helper method for date extraction. WyreCouncil strengthens date parsing validation and improves year rollover logic through normalization and defensive checks. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@uk_bin_collection/uk_bin_collection/councils/TorridgeDistrictCouncil.py`:
- Around line 55-63: The current parsing swallows missing/invalid SOAP body
content by setting inner_html = "" which hides format changes; update the
parsing around ElementTree.fromstring/ result and inner_html in
TorridgeDistrictCouncil (and analogous blocks at the other noted spots) to
validate result is not None and that result.text contains expected content, and
if not raise a clear exception (e.g., ValueError or a custom ParseError) with a
message describing which element was missing or which date phrase was
unrecognized; ensure the exception mentions the operation
(getRoundCalendarForUPRNResult) and include the raw resp.text or the offending
snippet to aid debugging.
- Line 49: Remove the disabled TLS verification on the SOAP call in
TorridgeDistrictCouncil by eliminating the verify=False parameter on the
requests.post call (in the code that constructs resp = requests.post(...)).
Ensure the request uses standard certificate validation (either omit the verify
arg or set verify=True) so the HTTPS connection validates the server
certificate; update any related tests/mocks that assumed insecure behavior if
necessary.
- Line 55: Replace the unsafe stdlib XML parse call dom =
ElementTree.fromstring(resp.text) with the hardened parser from defusedxml (e.g.
from defusedxml import ElementTree and then dom =
ElementTree.fromstring(resp.text) or import defusedxml.ElementTree as
ElementTree) and add defusedxml to project dependencies; update the import in
TorridgeDistrictCouncil.py where ElementTree is referenced so the call in the
code (dom, resp.text) uses defusedxml's ElementTree.fromstring to prevent
XXE/DoS.
In `@uk_bin_collection/uk_bin_collection/councils/WyreCouncil.py`:
- Around line 57-69: In the parsing loop in WyreCouncil.py (the block that finds
title_match and parses collection_date), stop silently continuing on a regex
miss or date parse failure and instead raise a clear exception (e.g., ValueError
or a custom WyreParsingError) that includes context (the failing heading_text,
date_text and any relevant identifiers) so callers/tests can detect format
changes; replace the two continue statements after the title_match check and the
datetime.strptime except block with raised exceptions and add/adjust unit tests
to assert these invalid inputs raise the new error.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 453c99e6-87a8-4980-ac05-d29605bedb27
📒 Files selected for processing (2)
uk_bin_collection/uk_bin_collection/councils/TorridgeDistrictCouncil.pyuk_bin_collection/uk_bin_collection/councils/WyreCouncil.py
| ) | ||
| requests.packages.urllib3.disable_warnings() | ||
| page = requests.post(url, headers=headers, data=post_data) | ||
| resp = requests.post(url, headers=headers, data=post_data, verify=False) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Verify whether the endpoint's certificate is actually the reason this was added.
curl -Iv https://collections-torridge.azurewebsites.net/WebService2.asmx 2>&1 | sed -n '1,40p'Repository: robbrad/UKBinCollectionData
Length of output: 1772
Do not ship this SOAP call with TLS verification disabled.
verify=False disables certificate validation and leaves the connection vulnerable to man-in-the-middle attacks. The endpoint currently has a valid certificate, so there is no technical reason to disable verification. Remove verify=False and enable proper TLS validation.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@uk_bin_collection/uk_bin_collection/councils/TorridgeDistrictCouncil.py` at
line 49, Remove the disabled TLS verification on the SOAP call in
TorridgeDistrictCouncil by eliminating the verify=False parameter on the
requests.post call (in the code that constructs resp = requests.post(...)).
Ensure the request uses standard certificate validation (either omit the verify
arg or set verify=True) so the HTTPS connection validates the server
certificate; update any related tests/mocks that assumed insecure behavior if
necessary.
| dom = ElementTree.fromstring(resp.text) | ||
| result = dom.find( | ||
| "./soap:Body" | ||
| "/a:getRoundCalendarForUPRNResponse" | ||
| "/a:getRoundCalendarForUPRNResult", | ||
| namespaces, | ||
| ) | ||
| # Make a BS4 object | ||
| soup = BeautifulSoup(page.text, features="html.parser") | ||
| soup.prettify() | ||
| inner_html = result.text if result is not None else "" | ||
|
|
There was a problem hiding this comment.
Unexpected SOAP/date formats should raise, not disappear.
A missing getRoundCalendarForUPRNResult or an unrecognized date phrase currently degrades to empty/partial output. That makes upstream format changes look like “no bins found” instead of a broken scraper.
Proposed change
dom = ElementTree.fromstring(resp.text)
result = dom.find(
"./soap:Body"
"/a:getRoundCalendarForUPRNResponse"
"/a:getRoundCalendarForUPRNResult",
namespaces,
)
- inner_html = result.text if result is not None else ""
+ if result is None or result.text is None:
+ raise ValueError(
+ "Torridge SOAP response missing getRoundCalendarForUPRNResult"
+ )
+ inner_html = result.text
@@
base_date = self._extract_base_date(value, today)
if base_date is None:
- continue
+ raise ValueError(f"Unexpected Torridge date format: {value!r}")
@@
- return None
+ return NoneBased on learnings, in uk_bin_collection/**/*.py, when parsing council bin collection data, prefer explicit failures (raise exceptions on unexpected formats) over silent defaults or swallowed errors. This ensures format changes are detected early. Use clear exception types, document error causes, and add tests to verify that invalid inputs raise as expected.
Also applies to: 89-91, 140-140
🧰 Tools
🪛 Ruff (0.15.9)
[error] 55-55: Using xml to parse untrusted data is known to be vulnerable to XML attacks; use defusedxml equivalents
(S314)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@uk_bin_collection/uk_bin_collection/councils/TorridgeDistrictCouncil.py`
around lines 55 - 63, The current parsing swallows missing/invalid SOAP body
content by setting inner_html = "" which hides format changes; update the
parsing around ElementTree.fromstring/ result and inner_html in
TorridgeDistrictCouncil (and analogous blocks at the other noted spots) to
validate result is not None and that result.text contains expected content, and
if not raise a clear exception (e.g., ValueError or a custom ParseError) with a
message describing which element was missing or which date phrase was
unrecognized; ensure the exception mentions the operation
(getRoundCalendarForUPRNResult) and include the raw resp.text or the offending
snippet to aid debugging.
| } | ||
| dom = ElementTree.fromstring(page.text) | ||
| page = dom.find( | ||
| dom = ElementTree.fromstring(resp.text) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cd /tmp && find . -name "TorridgeDistrictCouncil.py" 2>/dev/null | head -5Repository: robbrad/UKBinCollectionData
Length of output: 53
🏁 Script executed:
find . -path "*/TorridgeDistrictCouncil.py" 2>/dev/nullRepository: robbrad/UKBinCollectionData
Length of output: 144
🏁 Script executed:
find . -name "*Torridge*" -o -name "*torridge*" 2>/dev/nullRepository: robbrad/UKBinCollectionData
Length of output: 144
🏁 Script executed:
git ls-files | grep -i torridgeRepository: robbrad/UKBinCollectionData
Length of output: 142
🏁 Script executed:
cat -n uk_bin_collection/uk_bin_collection/councils/TorridgeDistrictCouncil.py | sed -n '1,70p'Repository: robbrad/UKBinCollectionData
Length of output: 3189
🏁 Script executed:
find . -name "requirements*.txt" -o -name "setup.py" -o -name "pyproject.toml" | head -10Repository: robbrad/UKBinCollectionData
Length of output: 135
🏁 Script executed:
cat pyproject.tomlRepository: robbrad/UKBinCollectionData
Length of output: 2145
🏁 Script executed:
cat uk_bin_collection_api_server/requirements.txtRepository: robbrad/UKBinCollectionData
Length of output: 133
🏁 Script executed:
find . -name "requirements.txt" -exec cat {} + | head -50Repository: robbrad/UKBinCollectionData
Length of output: 133
🏁 Script executed:
rg "defusedxml" --type pyRepository: robbrad/UKBinCollectionData
Length of output: 53
🏁 Script executed:
rg "ElementTree.fromstring" --type pyRepository: robbrad/UKBinCollectionData
Length of output: 314
🏁 Script executed:
rg "xml.etree" --type py | head -20Repository: robbrad/UKBinCollectionData
Length of output: 366
Use a hardened XML parser for the SOAP response.
ElementTree.fromstring(resp.text) parses network input with the stdlib parser, which is vulnerable to XXE and DoS attacks. Switch to defusedxml.ElementTree.fromstring(...) (will require adding defusedxml to project dependencies).
🧰 Tools
🪛 Ruff (0.15.9)
[error] 55-55: Using xml to parse untrusted data is known to be vulnerable to XML attacks; use defusedxml equivalents
(S314)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@uk_bin_collection/uk_bin_collection/councils/TorridgeDistrictCouncil.py` at
line 55, Replace the unsafe stdlib XML parse call dom =
ElementTree.fromstring(resp.text) with the hardened parser from defusedxml (e.g.
from defusedxml import ElementTree and then dom =
ElementTree.fromstring(resp.text) or import defusedxml.ElementTree as
ElementTree) and add defusedxml to project dependencies; update the import in
TorridgeDistrictCouncil.py where ElementTree is referenced so the call in the
code (dom, resp.text) uses defusedxml's ElementTree.fromstring to prevent
XXE/DoS.
| title_match = re.search( | ||
| r"Your next\s+(.+?)\s+collection", heading_text, re.IGNORECASE | ||
| ) | ||
| if not title_match: | ||
| continue | ||
| collection_title = title_match.group(1).strip() | ||
|
|
||
| date_text = " ".join(content.get_text(" ", strip=True).split()) | ||
| date_text = remove_ordinal_indicator_from_date_string(date_text) | ||
| try: | ||
| collection_date = datetime.strptime(date_text, "%A %d %B") | ||
| except ValueError: | ||
| continue |
There was a problem hiding this comment.
Fail loudly once a box looks like a real bin entry.
After the missing-node guard on Lines 51-54, a regex miss or date-parse miss means Wyre changed the expected bin markup again. continue here will quietly return partial data instead of surfacing the scraper breakage.
Proposed change
title_match = re.search(
r"Your next\s+(.+?)\s+collection", heading_text, re.IGNORECASE
)
if not title_match:
- continue
+ raise ValueError(f"Unexpected Wyre heading format: {heading_text!r}")
collection_title = title_match.group(1).strip()
date_text = " ".join(content.get_text(" ", strip=True).split())
date_text = remove_ordinal_indicator_from_date_string(date_text)
try:
collection_date = datetime.strptime(date_text, "%A %d %B")
- except ValueError:
- continue
+ except ValueError as exc:
+ raise ValueError(f"Unexpected Wyre date format: {date_text!r}") from excBased on learnings, in uk_bin_collection/**/*.py, when parsing council bin collection data, prefer explicit failures (raise exceptions on unexpected formats) over silent defaults or swallowed errors. This ensures format changes are detected early. Use clear exception types, document error causes, and add tests to verify that invalid inputs raise as expected.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@uk_bin_collection/uk_bin_collection/councils/WyreCouncil.py` around lines 57
- 69, In the parsing loop in WyreCouncil.py (the block that finds title_match
and parses collection_date), stop silently continuing on a regex miss or date
parse failure and instead raise a clear exception (e.g., ValueError or a custom
WyreParsingError) that includes context (the failing heading_text, date_text and
any relevant identifiers) so callers/tests can detect format changes; replace
the two continue statements after the title_match check and the
datetime.strptime except block with raised exceptions and add/adjust unit tests
to assert these invalid inputs raise the new error.
What this changes
The
/bincollections?uprn=endpoint now returns an extra.boxeddiv at the bottom of the list for a "Download collection calendar" link. That div has noh3.bin-collection-tasks__heading, so the scraper crashed with:Changes
.boxeddiv that doesn't have both the heading and thebin-collection-tasks__contentcontainer — covers the new calendar-download panel and any other promotional boxes the council might add.Your next X collection) instead of positionalsplit()[2:4], which breaks when visually-hidden text shifts the token positions.strptime(Fridayand17th Aprilare in separate<p>tags and.textleft multi-whitespace runs that broke%A %d %B).Test
Verified against UPRN
10003519994:Summary by CodeRabbit