Skip to content

fix: WyreCouncil - skip non-bin .boxed divs and harden date parsing#1952

Open
InertiaUK wants to merge 2 commits intorobbrad:masterfrom
InertiaUK:fix/wyre-skip-non-bin-boxes
Open

fix: WyreCouncil - skip non-bin .boxed divs and harden date parsing#1952
InertiaUK wants to merge 2 commits intorobbrad:masterfrom
InertiaUK:fix/wyre-skip-non-bin-boxes

Conversation

@InertiaUK
Copy link
Copy Markdown

@InertiaUK InertiaUK commented Apr 12, 2026

What this changes

The /bincollections?uprn= endpoint now returns an extra .boxed div at the bottom of the list for a "Download collection calendar" link. That div has no h3.bin-collection-tasks__heading, so the scraper crashed with:

AttributeError: 'NoneType' object has no attribute 'text'

Changes

  • Skip any .boxed div that doesn't have both the heading and the bin-collection-tasks__content container — covers the new calendar-download panel and any other promotional boxes the council might add.
  • Extract the bin type with a regex (Your next X collection) instead of positional split()[2:4], which breaks when visually-hidden text shifts the token positions.
  • Collapse whitespace on the date text before strptime (Friday and 17th April are in separate <p> tags and .text left multi-whitespace runs that broke %A %d %B).
  • Roll the year forward whenever the computed date is in the past, not just the December->January edge case.

Test

Verified against UPRN 10003519994:

{
  "bins": [
    { "type": "Blue bin",  "collectionDate": "17/04/2026" },
    { "type": "Green bin", "collectionDate": "17/04/2026" },
    { "type": "Grey bin",  "collectionDate": "24/04/2026" },
    { "type": "Red bin",   "collectionDate": "01/05/2026" }
  ]
}

Summary by CodeRabbit

  • Bug Fixes
    • Improved Torridge District Council date parsing to handle relative dates ("today", "tomorrow", weekday names) and calendar content
    • Enhanced Wyre Council collection data extraction with more robust handling of missing or malformed data

Torridge's SOAP API changed its response from explicit dates ("Mon 14 Apr")
to relative phrases ("Tomorrow then every Mon", "Today then every Tue")
with an embedded calendar table. The old regex-based parser returned empty
because it expected the old format.

The rewritten parser handles Today/Tomorrow/weekday-name phrases, falls back
to the old explicit format if it reappears, and gracefully skips "No X
collection for this address" entries.
The Wyre bin collection page now includes a 'Download calendar' box
among the .boxed divs. This box has no h3.bin-collection-tasks__heading,
which caused the scraper to crash on .text access.

Changes:
- Skip boxes missing the heading or content container
- Use regex to extract the bin name from 'Your next X collection'
- Collapse whitespace on the date text before strptime (p tags produced
  multi-whitespace runs after .text)
- Roll year forward only when the computed date is in the past, instead
  of only handling the December->January edge case
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 12, 2026

📝 Walkthrough

Walkthrough

Two council bin collection parsers receive updates to improve date parsing robustness. TorridgeDistrictCouncil refactors SOAP response handling to support relative date phrases and introduces a new helper method for date extraction. WyreCouncil strengthens date parsing validation and improves year rollover logic through normalization and defensive checks.

Changes

Cohort / File(s) Summary
Torridge District Council SOAP Parser
uk_bin_collection/uk_bin_collection/councils/TorridgeDistrictCouncil.py
Refactored date parsing to handle relative phrases ("today", "tomorrow", weekday names) and explicit date strings. Added _extract_base_date() helper method with WEEKDAYS constant. Updated SOAP request headers, response handling flow (using resp instead of page), and replaced fixed-type bin parsing with generalized loop over HTML elements. Enhanced UPRN validation to use direct conditional check.
Wyre Council Date Extraction
uk_bin_collection/uk_bin_collection/councils/WyreCouncil.py
Improved robustness of collection_title extraction via regex-based parsing instead of fixed index slicing. Added defensive parsing for missing heading/content elements. Enhanced collection_date parsing with text normalization, ordinal stripping, and exception handling. Updated year rollover logic from month-based check to direct date comparison.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested reviewers

  • dp247

Poem

🐰 Hop-hop, the dates now dance with rhyme,
"Tomorrow" and "today" keep perfect time,
Relative phrases bloom like garden greens,
While Wyre's checks make parsing pristine,
No broken bins when we're done! 🗑️✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix: WyreCouncil - skip non-bin .boxed divs and harden date parsing' accurately reflects the main changes in the WyreCouncil.py file (the primary focus of the PR), being specific and descriptive of the actual modifications made.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@uk_bin_collection/uk_bin_collection/councils/TorridgeDistrictCouncil.py`:
- Around line 55-63: The current parsing swallows missing/invalid SOAP body
content by setting inner_html = "" which hides format changes; update the
parsing around ElementTree.fromstring/ result and inner_html in
TorridgeDistrictCouncil (and analogous blocks at the other noted spots) to
validate result is not None and that result.text contains expected content, and
if not raise a clear exception (e.g., ValueError or a custom ParseError) with a
message describing which element was missing or which date phrase was
unrecognized; ensure the exception mentions the operation
(getRoundCalendarForUPRNResult) and include the raw resp.text or the offending
snippet to aid debugging.
- Line 49: Remove the disabled TLS verification on the SOAP call in
TorridgeDistrictCouncil by eliminating the verify=False parameter on the
requests.post call (in the code that constructs resp = requests.post(...)).
Ensure the request uses standard certificate validation (either omit the verify
arg or set verify=True) so the HTTPS connection validates the server
certificate; update any related tests/mocks that assumed insecure behavior if
necessary.
- Line 55: Replace the unsafe stdlib XML parse call dom =
ElementTree.fromstring(resp.text) with the hardened parser from defusedxml (e.g.
from defusedxml import ElementTree and then dom =
ElementTree.fromstring(resp.text) or import defusedxml.ElementTree as
ElementTree) and add defusedxml to project dependencies; update the import in
TorridgeDistrictCouncil.py where ElementTree is referenced so the call in the
code (dom, resp.text) uses defusedxml's ElementTree.fromstring to prevent
XXE/DoS.

In `@uk_bin_collection/uk_bin_collection/councils/WyreCouncil.py`:
- Around line 57-69: In the parsing loop in WyreCouncil.py (the block that finds
title_match and parses collection_date), stop silently continuing on a regex
miss or date parse failure and instead raise a clear exception (e.g., ValueError
or a custom WyreParsingError) that includes context (the failing heading_text,
date_text and any relevant identifiers) so callers/tests can detect format
changes; replace the two continue statements after the title_match check and the
datetime.strptime except block with raised exceptions and add/adjust unit tests
to assert these invalid inputs raise the new error.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 453c99e6-87a8-4980-ac05-d29605bedb27

📥 Commits

Reviewing files that changed from the base of the PR and between 60bd3cc and 4f9a101.

📒 Files selected for processing (2)
  • uk_bin_collection/uk_bin_collection/councils/TorridgeDistrictCouncil.py
  • uk_bin_collection/uk_bin_collection/councils/WyreCouncil.py

)
requests.packages.urllib3.disable_warnings()
page = requests.post(url, headers=headers, data=post_data)
resp = requests.post(url, headers=headers, data=post_data, verify=False)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Verify whether the endpoint's certificate is actually the reason this was added.
curl -Iv https://collections-torridge.azurewebsites.net/WebService2.asmx 2>&1 | sed -n '1,40p'

Repository: robbrad/UKBinCollectionData

Length of output: 1772


Do not ship this SOAP call with TLS verification disabled.

verify=False disables certificate validation and leaves the connection vulnerable to man-in-the-middle attacks. The endpoint currently has a valid certificate, so there is no technical reason to disable verification. Remove verify=False and enable proper TLS validation.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@uk_bin_collection/uk_bin_collection/councils/TorridgeDistrictCouncil.py` at
line 49, Remove the disabled TLS verification on the SOAP call in
TorridgeDistrictCouncil by eliminating the verify=False parameter on the
requests.post call (in the code that constructs resp = requests.post(...)).
Ensure the request uses standard certificate validation (either omit the verify
arg or set verify=True) so the HTTPS connection validates the server
certificate; update any related tests/mocks that assumed insecure behavior if
necessary.

Comment on lines +55 to 63
dom = ElementTree.fromstring(resp.text)
result = dom.find(
"./soap:Body"
"/a:getRoundCalendarForUPRNResponse"
"/a:getRoundCalendarForUPRNResult",
namespaces,
)
# Make a BS4 object
soup = BeautifulSoup(page.text, features="html.parser")
soup.prettify()
inner_html = result.text if result is not None else ""

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Unexpected SOAP/date formats should raise, not disappear.

A missing getRoundCalendarForUPRNResult or an unrecognized date phrase currently degrades to empty/partial output. That makes upstream format changes look like “no bins found” instead of a broken scraper.

Proposed change
         dom = ElementTree.fromstring(resp.text)
         result = dom.find(
             "./soap:Body"
             "/a:getRoundCalendarForUPRNResponse"
             "/a:getRoundCalendarForUPRNResult",
             namespaces,
         )
-        inner_html = result.text if result is not None else ""
+        if result is None or result.text is None:
+            raise ValueError(
+                "Torridge SOAP response missing getRoundCalendarForUPRNResult"
+            )
+        inner_html = result.text
@@
             base_date = self._extract_base_date(value, today)
             if base_date is None:
-                continue
+                raise ValueError(f"Unexpected Torridge date format: {value!r}")
@@
-        return None
+        return None

Based on learnings, in uk_bin_collection/**/*.py, when parsing council bin collection data, prefer explicit failures (raise exceptions on unexpected formats) over silent defaults or swallowed errors. This ensures format changes are detected early. Use clear exception types, document error causes, and add tests to verify that invalid inputs raise as expected.

Also applies to: 89-91, 140-140

🧰 Tools
🪛 Ruff (0.15.9)

[error] 55-55: Using xml to parse untrusted data is known to be vulnerable to XML attacks; use defusedxml equivalents

(S314)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@uk_bin_collection/uk_bin_collection/councils/TorridgeDistrictCouncil.py`
around lines 55 - 63, The current parsing swallows missing/invalid SOAP body
content by setting inner_html = "" which hides format changes; update the
parsing around ElementTree.fromstring/ result and inner_html in
TorridgeDistrictCouncil (and analogous blocks at the other noted spots) to
validate result is not None and that result.text contains expected content, and
if not raise a clear exception (e.g., ValueError or a custom ParseError) with a
message describing which element was missing or which date phrase was
unrecognized; ensure the exception mentions the operation
(getRoundCalendarForUPRNResult) and include the raw resp.text or the offending
snippet to aid debugging.

}
dom = ElementTree.fromstring(page.text)
page = dom.find(
dom = ElementTree.fromstring(resp.text)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cd /tmp && find . -name "TorridgeDistrictCouncil.py" 2>/dev/null | head -5

Repository: robbrad/UKBinCollectionData

Length of output: 53


🏁 Script executed:

find . -path "*/TorridgeDistrictCouncil.py" 2>/dev/null

Repository: robbrad/UKBinCollectionData

Length of output: 144


🏁 Script executed:

find . -name "*Torridge*" -o -name "*torridge*" 2>/dev/null

Repository: robbrad/UKBinCollectionData

Length of output: 144


🏁 Script executed:

git ls-files | grep -i torridge

Repository: robbrad/UKBinCollectionData

Length of output: 142


🏁 Script executed:

cat -n uk_bin_collection/uk_bin_collection/councils/TorridgeDistrictCouncil.py | sed -n '1,70p'

Repository: robbrad/UKBinCollectionData

Length of output: 3189


🏁 Script executed:

find . -name "requirements*.txt" -o -name "setup.py" -o -name "pyproject.toml" | head -10

Repository: robbrad/UKBinCollectionData

Length of output: 135


🏁 Script executed:

cat pyproject.toml

Repository: robbrad/UKBinCollectionData

Length of output: 2145


🏁 Script executed:

cat uk_bin_collection_api_server/requirements.txt

Repository: robbrad/UKBinCollectionData

Length of output: 133


🏁 Script executed:

find . -name "requirements.txt" -exec cat {} + | head -50

Repository: robbrad/UKBinCollectionData

Length of output: 133


🏁 Script executed:

rg "defusedxml" --type py

Repository: robbrad/UKBinCollectionData

Length of output: 53


🏁 Script executed:

rg "ElementTree.fromstring" --type py

Repository: robbrad/UKBinCollectionData

Length of output: 314


🏁 Script executed:

rg "xml.etree" --type py | head -20

Repository: robbrad/UKBinCollectionData

Length of output: 366


Use a hardened XML parser for the SOAP response.

ElementTree.fromstring(resp.text) parses network input with the stdlib parser, which is vulnerable to XXE and DoS attacks. Switch to defusedxml.ElementTree.fromstring(...) (will require adding defusedxml to project dependencies).

🧰 Tools
🪛 Ruff (0.15.9)

[error] 55-55: Using xml to parse untrusted data is known to be vulnerable to XML attacks; use defusedxml equivalents

(S314)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@uk_bin_collection/uk_bin_collection/councils/TorridgeDistrictCouncil.py` at
line 55, Replace the unsafe stdlib XML parse call dom =
ElementTree.fromstring(resp.text) with the hardened parser from defusedxml (e.g.
from defusedxml import ElementTree and then dom =
ElementTree.fromstring(resp.text) or import defusedxml.ElementTree as
ElementTree) and add defusedxml to project dependencies; update the import in
TorridgeDistrictCouncil.py where ElementTree is referenced so the call in the
code (dom, resp.text) uses defusedxml's ElementTree.fromstring to prevent
XXE/DoS.

Comment on lines +57 to +69
title_match = re.search(
r"Your next\s+(.+?)\s+collection", heading_text, re.IGNORECASE
)
if not title_match:
continue
collection_title = title_match.group(1).strip()

date_text = " ".join(content.get_text(" ", strip=True).split())
date_text = remove_ordinal_indicator_from_date_string(date_text)
try:
collection_date = datetime.strptime(date_text, "%A %d %B")
except ValueError:
continue
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fail loudly once a box looks like a real bin entry.

After the missing-node guard on Lines 51-54, a regex miss or date-parse miss means Wyre changed the expected bin markup again. continue here will quietly return partial data instead of surfacing the scraper breakage.

Proposed change
             title_match = re.search(
                 r"Your next\s+(.+?)\s+collection", heading_text, re.IGNORECASE
             )
             if not title_match:
-                continue
+                raise ValueError(f"Unexpected Wyre heading format: {heading_text!r}")
             collection_title = title_match.group(1).strip()
 
             date_text = " ".join(content.get_text(" ", strip=True).split())
             date_text = remove_ordinal_indicator_from_date_string(date_text)
             try:
                 collection_date = datetime.strptime(date_text, "%A %d %B")
-            except ValueError:
-                continue
+            except ValueError as exc:
+                raise ValueError(f"Unexpected Wyre date format: {date_text!r}") from exc

Based on learnings, in uk_bin_collection/**/*.py, when parsing council bin collection data, prefer explicit failures (raise exceptions on unexpected formats) over silent defaults or swallowed errors. This ensures format changes are detected early. Use clear exception types, document error causes, and add tests to verify that invalid inputs raise as expected.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@uk_bin_collection/uk_bin_collection/councils/WyreCouncil.py` around lines 57
- 69, In the parsing loop in WyreCouncil.py (the block that finds title_match
and parses collection_date), stop silently continuing on a regex miss or date
parse failure and instead raise a clear exception (e.g., ValueError or a custom
WyreParsingError) that includes context (the failing heading_text, date_text and
any relevant identifiers) so callers/tests can detect format changes; replace
the two continue statements after the title_match check and the
datetime.strptime except block with raised exceptions and add/adjust unit tests
to assert these invalid inputs raise the new error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant