fix: WyreCouncil - skip non-bin .boxed divs and harden date parsing by InertiaUK · Pull Request #1952 · robbrad/UKBinCollectionData

InertiaUK · 2026-04-12T10:35:41Z

What this changes

The /bincollections?uprn= endpoint now returns an extra .boxed div at the bottom of the list for a "Download collection calendar" link. That div has no h3.bin-collection-tasks__heading, so the scraper crashed with:

AttributeError: 'NoneType' object has no attribute 'text'

Changes

Skip any .boxed div that doesn't have both the heading and the bin-collection-tasks__content container — covers the new calendar-download panel and any other promotional boxes the council might add.
Extract the bin type with a regex (Your next X collection) instead of positional split()[2:4], which breaks when visually-hidden text shifts the token positions.
Collapse whitespace on the date text before strptime (Friday and 17th April are in separate <p> tags and .text left multi-whitespace runs that broke %A %d %B).
Roll the year forward whenever the computed date is in the past, not just the December->January edge case.

Test

Verified against UPRN 10003519994:

{
  "bins": [
    { "type": "Blue bin",  "collectionDate": "17/04/2026" },
    { "type": "Green bin", "collectionDate": "17/04/2026" },
    { "type": "Grey bin",  "collectionDate": "24/04/2026" },
    { "type": "Red bin",   "collectionDate": "01/05/2026" }
  ]
}

Summary by CodeRabbit

Bug Fixes
- Improved Torridge District Council date parsing to handle relative dates ("today", "tomorrow", weekday names) and calendar content
- Enhanced Wyre Council collection data extraction with more robust handling of missing or malformed data

Torridge's SOAP API changed its response from explicit dates ("Mon 14 Apr") to relative phrases ("Tomorrow then every Mon", "Today then every Tue") with an embedded calendar table. The old regex-based parser returned empty because it expected the old format. The rewritten parser handles Today/Tomorrow/weekday-name phrases, falls back to the old explicit format if it reappears, and gracefully skips "No X collection for this address" entries.

The Wyre bin collection page now includes a 'Download calendar' box among the .boxed divs. This box has no h3.bin-collection-tasks__heading, which caused the scraper to crash on .text access. Changes: - Skip boxes missing the heading or content container - Use regex to extract the bin name from 'Your next X collection' - Collapse whitespace on the date text before strptime (p tags produced multi-whitespace runs after .text) - Roll year forward only when the computed date is in the past, instead of only handling the December->January edge case

coderabbitai · 2026-04-12T10:35:56Z

📝 Walkthrough

Walkthrough

Two council bin collection parsers receive updates to improve date parsing robustness. TorridgeDistrictCouncil refactors SOAP response handling to support relative date phrases and introduces a new helper method for date extraction. WyreCouncil strengthens date parsing validation and improves year rollover logic through normalization and defensive checks.

Changes

Cohort / File(s)	Summary
Torridge District Council SOAP Parser `uk_bin_collection/uk_bin_collection/councils/TorridgeDistrictCouncil.py`	Refactored date parsing to handle relative phrases ("today", "tomorrow", weekday names) and explicit date strings. Added `_extract_base_date()` helper method with `WEEKDAYS` constant. Updated SOAP request headers, response handling flow (using `resp` instead of `page`), and replaced fixed-type bin parsing with generalized loop over HTML elements. Enhanced UPRN validation to use direct conditional check.
Wyre Council Date Extraction `uk_bin_collection/uk_bin_collection/councils/WyreCouncil.py`	Improved robustness of `collection_title` extraction via regex-based parsing instead of fixed index slicing. Added defensive parsing for missing heading/content elements. Enhanced `collection_date` parsing with text normalization, ordinal stripping, and exception handling. Updated year rollover logic from month-based check to direct date comparison.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested reviewers

dp247

Poem

🐰 Hop-hop, the dates now dance with rhyme,
"Tomorrow" and "today" keep perfect time,
Relative phrases bloom like garden greens,
While Wyre's checks make parsing pristine,
No broken bins when we're done! 🗑️✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'fix: WyreCouncil - skip non-bin .boxed divs and harden date parsing' accurately reflects the main changes in the WyreCouncil.py file (the primary focus of the PR), being specific and descriptive of the actual modifications made.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@uk_bin_collection/uk_bin_collection/councils/TorridgeDistrictCouncil.py`:
- Around line 55-63: The current parsing swallows missing/invalid SOAP body
content by setting inner_html = "" which hides format changes; update the
parsing around ElementTree.fromstring/ result and inner_html in
TorridgeDistrictCouncil (and analogous blocks at the other noted spots) to
validate result is not None and that result.text contains expected content, and
if not raise a clear exception (e.g., ValueError or a custom ParseError) with a
message describing which element was missing or which date phrase was
unrecognized; ensure the exception mentions the operation
(getRoundCalendarForUPRNResult) and include the raw resp.text or the offending
snippet to aid debugging.
- Line 49: Remove the disabled TLS verification on the SOAP call in
TorridgeDistrictCouncil by eliminating the verify=False parameter on the
requests.post call (in the code that constructs resp = requests.post(...)).
Ensure the request uses standard certificate validation (either omit the verify
arg or set verify=True) so the HTTPS connection validates the server
certificate; update any related tests/mocks that assumed insecure behavior if
necessary.
- Line 55: Replace the unsafe stdlib XML parse call dom =
ElementTree.fromstring(resp.text) with the hardened parser from defusedxml (e.g.
from defusedxml import ElementTree and then dom =
ElementTree.fromstring(resp.text) or import defusedxml.ElementTree as
ElementTree) and add defusedxml to project dependencies; update the import in
TorridgeDistrictCouncil.py where ElementTree is referenced so the call in the
code (dom, resp.text) uses defusedxml's ElementTree.fromstring to prevent
XXE/DoS.

In `@uk_bin_collection/uk_bin_collection/councils/WyreCouncil.py`:
- Around line 57-69: In the parsing loop in WyreCouncil.py (the block that finds
title_match and parses collection_date), stop silently continuing on a regex
miss or date parse failure and instead raise a clear exception (e.g., ValueError
or a custom WyreParsingError) that includes context (the failing heading_text,
date_text and any relevant identifiers) so callers/tests can detect format
changes; replace the two continue statements after the title_match check and the
datetime.strptime except block with raised exceptions and add/adjust unit tests
to assert these invalid inputs raise the new error.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 453c99e6-87a8-4980-ac05-d29605bedb27

📥 Commits

Reviewing files that changed from the base of the PR and between 60bd3cc and 4f9a101.

📒 Files selected for processing (2)

uk_bin_collection/uk_bin_collection/councils/TorridgeDistrictCouncil.py
uk_bin_collection/uk_bin_collection/councils/WyreCouncil.py

coderabbitai · 2026-04-12T10:39:55Z

uk_bin_collection/uk_bin_collection/councils/TorridgeDistrictCouncil.py

        )
        requests.packages.urllib3.disable_warnings()
-        page = requests.post(url, headers=headers, data=post_data)
+        resp = requests.post(url, headers=headers, data=post_data, verify=False)


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail # Verify whether the endpoint's certificate is actually the reason this was added. curl -Iv https://collections-torridge.azurewebsites.net/WebService2.asmx 2>&1 | sed -n '1,40p'

Repository: robbrad/UKBinCollectionData

Length of output: 1772

Do not ship this SOAP call with TLS verification disabled.

verify=False disables certificate validation and leaves the connection vulnerable to man-in-the-middle attacks. The endpoint currently has a valid certificate, so there is no technical reason to disable verification. Remove verify=False and enable proper TLS validation.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@uk_bin_collection/uk_bin_collection/councils/TorridgeDistrictCouncil.py` at line 49, Remove the disabled TLS verification on the SOAP call in TorridgeDistrictCouncil by eliminating the verify=False parameter on the requests.post call (in the code that constructs resp = requests.post(...)). Ensure the request uses standard certificate validation (either omit the verify arg or set verify=True) so the HTTPS connection validates the server certificate; update any related tests/mocks that assumed insecure behavior if necessary.

coderabbitai · 2026-04-12T10:39:55Z

uk_bin_collection/uk_bin_collection/councils/TorridgeDistrictCouncil.py

+        dom = ElementTree.fromstring(resp.text)
+        result = dom.find(
            "./soap:Body"
            "/a:getRoundCalendarForUPRNResponse"
            "/a:getRoundCalendarForUPRNResult",
            namespaces,
        )
-        # Make a BS4 object
-        soup = BeautifulSoup(page.text, features="html.parser")
-        soup.prettify()
+        inner_html = result.text if result is not None else ""



⚠️ Potential issue | 🟠 Major

Unexpected SOAP/date formats should raise, not disappear.

A missing getRoundCalendarForUPRNResult or an unrecognized date phrase currently degrades to empty/partial output. That makes upstream format changes look like “no bins found” instead of a broken scraper.

Proposed change

dom = ElementTree.fromstring(resp.text) result = dom.find( "./soap:Body" "/a:getRoundCalendarForUPRNResponse" "/a:getRoundCalendarForUPRNResult", namespaces, ) - inner_html = result.text if result is not None else "" + if result is None or result.text is None: + raise ValueError( + "Torridge SOAP response missing getRoundCalendarForUPRNResult" + ) + inner_html = result.text @@ base_date = self._extract_base_date(value, today) if base_date is None: - continue + raise ValueError(f"Unexpected Torridge date format: {value!r}") @@ - return None + return None

Based on learnings, in uk_bin_collection/**/*.py, when parsing council bin collection data, prefer explicit failures (raise exceptions on unexpected formats) over silent defaults or swallowed errors. This ensures format changes are detected early. Use clear exception types, document error causes, and add tests to verify that invalid inputs raise as expected.

Also applies to: 89-91, 140-140

🧰 Tools

🪛 Ruff (0.15.9)

[error] 55-55: Using xml to parse untrusted data is known to be vulnerable to XML attacks; use defusedxml equivalents

(S314)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@uk_bin_collection/uk_bin_collection/councils/TorridgeDistrictCouncil.py` around lines 55 - 63, The current parsing swallows missing/invalid SOAP body content by setting inner_html = "" which hides format changes; update the parsing around ElementTree.fromstring/ result and inner_html in TorridgeDistrictCouncil (and analogous blocks at the other noted spots) to validate result is not None and that result.text contains expected content, and if not raise a clear exception (e.g., ValueError or a custom ParseError) with a message describing which element was missing or which date phrase was unrecognized; ensure the exception mentions the operation (getRoundCalendarForUPRNResult) and include the raw resp.text or the offending snippet to aid debugging.

coderabbitai · 2026-04-12T10:39:55Z

uk_bin_collection/uk_bin_collection/councils/TorridgeDistrictCouncil.py

        }
-        dom = ElementTree.fromstring(page.text)
-        page = dom.find(
+        dom = ElementTree.fromstring(resp.text)


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cd /tmp && find . -name "TorridgeDistrictCouncil.py" 2>/dev/null | head -5

Repository: robbrad/UKBinCollectionData

Length of output: 53

🏁 Script executed:

find . -path "*/TorridgeDistrictCouncil.py" 2>/dev/null

Repository: robbrad/UKBinCollectionData

Length of output: 144

🏁 Script executed:

find . -name "*Torridge*" -o -name "*torridge*" 2>/dev/null

Repository: robbrad/UKBinCollectionData

Length of output: 144

🏁 Script executed:

git ls-files | grep -i torridge

Repository: robbrad/UKBinCollectionData

Length of output: 142

🏁 Script executed:

cat -n uk_bin_collection/uk_bin_collection/councils/TorridgeDistrictCouncil.py | sed -n '1,70p'

Repository: robbrad/UKBinCollectionData

Length of output: 3189

🏁 Script executed:

find . -name "requirements*.txt" -o -name "setup.py" -o -name "pyproject.toml" | head -10

Repository: robbrad/UKBinCollectionData

Length of output: 135

🏁 Script executed:

cat pyproject.toml

Repository: robbrad/UKBinCollectionData

Length of output: 2145

🏁 Script executed:

cat uk_bin_collection_api_server/requirements.txt

Repository: robbrad/UKBinCollectionData

Length of output: 133

🏁 Script executed:

find . -name "requirements.txt" -exec cat {} + | head -50

Repository: robbrad/UKBinCollectionData

Length of output: 133

🏁 Script executed:

rg "defusedxml" --type py

Repository: robbrad/UKBinCollectionData

Length of output: 53

🏁 Script executed:

rg "ElementTree.fromstring" --type py

Repository: robbrad/UKBinCollectionData

Length of output: 314

🏁 Script executed:

rg "xml.etree" --type py | head -20

Repository: robbrad/UKBinCollectionData

Length of output: 366

Use a hardened XML parser for the SOAP response.

ElementTree.fromstring(resp.text) parses network input with the stdlib parser, which is vulnerable to XXE and DoS attacks. Switch to defusedxml.ElementTree.fromstring(...) (will require adding defusedxml to project dependencies).

🧰 Tools

🪛 Ruff (0.15.9)

[error] 55-55: Using xml to parse untrusted data is known to be vulnerable to XML attacks; use defusedxml equivalents

(S314)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@uk_bin_collection/uk_bin_collection/councils/TorridgeDistrictCouncil.py` at line 55, Replace the unsafe stdlib XML parse call dom = ElementTree.fromstring(resp.text) with the hardened parser from defusedxml (e.g. from defusedxml import ElementTree and then dom = ElementTree.fromstring(resp.text) or import defusedxml.ElementTree as ElementTree) and add defusedxml to project dependencies; update the import in TorridgeDistrictCouncil.py where ElementTree is referenced so the call in the code (dom, resp.text) uses defusedxml's ElementTree.fromstring to prevent XXE/DoS.

coderabbitai · 2026-04-12T10:39:55Z

uk_bin_collection/uk_bin_collection/councils/WyreCouncil.py

+            title_match = re.search(
+                r"Your next\s+(.+?)\s+collection", heading_text, re.IGNORECASE
            )
+            if not title_match:
+                continue
+            collection_title = title_match.group(1).strip()
+
+            date_text = " ".join(content.get_text(" ", strip=True).split())
+            date_text = remove_ordinal_indicator_from_date_string(date_text)
+            try:
+                collection_date = datetime.strptime(date_text, "%A %d %B")
+            except ValueError:
+                continue


⚠️ Potential issue | 🟠 Major

Fail loudly once a box looks like a real bin entry.

After the missing-node guard on Lines 51-54, a regex miss or date-parse miss means Wyre changed the expected bin markup again. continue here will quietly return partial data instead of surfacing the scraper breakage.

Proposed change

title_match = re.search( r"Your next\s+(.+?)\s+collection", heading_text, re.IGNORECASE ) if not title_match: - continue + raise ValueError(f"Unexpected Wyre heading format: {heading_text!r}") collection_title = title_match.group(1).strip() date_text = " ".join(content.get_text(" ", strip=True).split()) date_text = remove_ordinal_indicator_from_date_string(date_text) try: collection_date = datetime.strptime(date_text, "%A %d %B") - except ValueError: - continue + except ValueError as exc: + raise ValueError(f"Unexpected Wyre date format: {date_text!r}") from exc

Based on learnings, in uk_bin_collection/**/*.py, when parsing council bin collection data, prefer explicit failures (raise exceptions on unexpected formats) over silent defaults or swallowed errors. This ensures format changes are detected early. Use clear exception types, document error causes, and add tests to verify that invalid inputs raise as expected.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@uk_bin_collection/uk_bin_collection/councils/WyreCouncil.py` around lines 57 - 69, In the parsing loop in WyreCouncil.py (the block that finds title_match and parses collection_date), stop silently continuing on a regex miss or date parse failure and instead raise a clear exception (e.g., ValueError or a custom WyreParsingError) that includes context (the failing heading_text, date_text and any relevant identifiers) so callers/tests can detect format changes; replace the two continue statements after the title_match check and the datetime.strptime except block with raised exceptions and add/adjust unit tests to assert these invalid inputs raise the new error.

InertiaUK added 2 commits April 12, 2026 09:37

coderabbitai bot reviewed Apr 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: WyreCouncil - skip non-bin .boxed divs and harden date parsing#1952

fix: WyreCouncil - skip non-bin .boxed divs and harden date parsing#1952
InertiaUK wants to merge 2 commits intorobbrad:masterfrom
InertiaUK:fix/wyre-skip-non-bin-boxes

InertiaUK commented Apr 12, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Apr 12, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Apr 12, 2026

Uh oh!

coderabbitai bot Apr 12, 2026

Uh oh!

coderabbitai bot Apr 12, 2026

Uh oh!

coderabbitai bot Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

InertiaUK commented Apr 12, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this changes

Changes

Test

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

InertiaUK commented Apr 12, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 12, 2026 •

edited

Loading