fix: MidlothianCouncil - rewrite for MyMidlothian Granicus fillform by InertiaUK · Pull Request #1956 · robbrad/UKBinCollectionData

InertiaUK · 2026-04-12T10:36:51Z

What this changes

Midlothian's directory_search.php?directoryID=35 endpoint is gone. The bin collection service has moved to a new MyMidlothian self-service portal at https://my.midlothian.gov.uk/service/Bin_Collection_Dates, which is a Granicus fillform rendered inside an iframe.

New flow

The old scraper's multi-step search-results -> address-link -> ul.data-table navigation no longer applies. The new layout:

Load the service page and switch into the fillform-frame-1 iframe.
Type the postcode into the postcode input — the listAddress <select> dropdown auto-populates via AJAX (no explicit search button).
Select the matching address by visible text. Selecting an address auto-fills six per-bin date fields in the iframe: dateRecycling, dateFood, dateGarden, dateCard, dateGlass, dateResidual. No submit step.
Read each date field's value attribute — they're already in dd/mm/yyyy format.

Bot detection note

The fillform iframe detects both headless Chrome and vanilla Selenium (presumably via navigator.webdriver) and refuses to populate the address dropdown. The scraper now uses undetected_chromedriver in non-headless mode when a DISPLAY environment variable is available (e.g. Xvfb on a headless server), and falls back to create_webdriver otherwise.

Test

Verified via the VPS wrapper against EH19 2EB #52:

{
  "bins": [
    { "type": "Recycling",     "collectionDate": "13/04/2026" },
    { "type": "Food waste",    "collectionDate": "15/04/2026" },
    { "type": "Garden waste",  "collectionDate": "17/04/2026" },
    { "type": "Cardboard",     "collectionDate": "20/04/2026" },
    { "type": "Glass",         "collectionDate": "24/04/2026" },
    { "type": "General waste", "collectionDate": "27/04/2026" }
  ]
}

Summary by CodeRabbit

Bug Fixes
- Enhanced collection data retrieval reliability across multiple council integrations (Chichester, Midlothian, Stockton-on-Tees, Torridge, and Wyre).
- Improved address selection and web form interaction robustness.
- Strengthened input validation and error handling for missing data.

Torridge's SOAP API changed its response from explicit dates ("Mon 14 Apr") to relative phrases ("Tomorrow then every Mon", "Today then every Tue") with an embedded calendar table. The old regex-based parser returned empty because it expected the old format. The rewritten parser handles Today/Tomorrow/weekday-name phrases, falls back to the old explicit format if it reappears, and gracefully skips "No X collection for this address" entries.

The Wyre bin collection page now includes a 'Download calendar' box among the .boxed divs. This box has no h3.bin-collection-tasks__heading, which caused the scraper to crash on .text access. Changes: - Skip boxes missing the heading or content container - Use regex to extract the bin name from 'Your next X collection' - Collapse whitespace on the date text before strptime (p tags produced multi-whitespace runs after .text) - Roll year forward only when the computed date is in the past, instead of only handling the December->January edge case

…e results The checkyourbinday page is now: - Gated by Cloudflare Turnstile (requires non-headless UC) - Using WASTECOLLECTIONCALENDARV7_* element IDs (was V5) - Rendering collection dates inline as .gi-summary-blocklist__row divs after address selection — no separate NEXT/submit step Changes: - Update all element IDs from V5 to V7 - Add Cloudflare challenge wait loop (up to 50s) - Dismiss cookie consent before interaction (was blocking button clicks) - Replace old table-based parsing with .gi-summary-blocklist__row scraping - Use Select.select_by_visible_text with stale-retry instead of manual option.click() loop (which crashed on AJAX re-renders) - Remove the smart_select_address helper's dead fuzzy/strict split

…anner Stockton's AchieveForms form name changed from LOOKUPBINDATESBYADDRESSSKIPOUTOFREGION to LOOKUPBINDATESBYADDRESSSKIPOUTOFREGIONV2, so every element ID needed the V2 suffix added. Also added a cookie-banner dismissal step — the banner was covering the search button and intercepting clicks.

The MyWestSuffolk.aspx IIS endpoint returns a 404 page to requests without a User-Agent header. Adding a realistic Chrome UA restores the full response (57KB), letting the existing parser pick up the bin collection panel correctly.

coderabbitai · 2026-04-12T10:36:57Z

📝 Walkthrough

Walkthrough

Updated five UK council bin collection scrapers with modernized integrations: Chichester and Midlothian migrated to Selenium-based form automation with enhanced robustness (Cloudflare detection, cookie dismissal, dropdown retry logic); Stockton updated to V2 element IDs; Torridge refactored SOAP/HTML parsing with relative date handling; WestSuffolk added User-Agent header.

Changes

Cohort / File(s)	Summary
Chichester Modernization `uk_bin_collection/.../ChichesterDistrictCouncil.py`	Migrated element IDs from V5 to V7; added Cloudflare Turnstile detection and cookie banner dismissal; refactored address selection with fuzzy matching via `_pick_option`; replaced table-based result parsing with `div.gi-summary-blocklist` structure; added stale element retry logic (3 attempts) on dropdown selection.
Midlothian HTTP→Selenium Migration `uk_bin_collection/.../MidlothianCouncil.py`	Replaced HTTP+BeautifulSoup directory search with Selenium iframe form-filling; removed URL construction and HTML helpers; added postcode/address selection via `_pick_address_option`; reads bin dates from specific field IDs; added optional `undetected_chromedriver` fallback; deterministic driver cleanup in `finally` block; raises `ValueError` for missing required fields.
Stockton Locator Updates `uk_bin_collection/.../StocktonOnTeesCouncil.py`	Added cookie consent banner dismissal (non-blocking); updated Selenium locator IDs from `LOOKUPBINDATESBYADDRESSSKIPOUTOFREGION_` to `...V2_` variants; retained `user_paon` matching logic in address XPath.
Torridge Date Parsing Refactor `uk_bin_collection/.../TorridgeDistrictCouncil.py`	Added explicit `SOAPAction` header to SOAP requests; replaced hardcoded bin-label extraction (regex over specific labels) with generalized loop over `<b>/<B>` elements with filtering (sentinel sections, year patterns, "No collection" entries); introduced `_extract_base_date` helper to parse relative dates ("today", weekday phrases) and explicit "DayNum MonthName" patterns (current/next year fallback).
WestSuffolk User-Agent `uk_bin_collection/.../WestSuffolkCouncil.py`	Added browser-like `User-Agent` header to HTTP request to `MyWestSuffolk.aspx`.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~65 minutes

Possibly related PRs

March 2026 Release #1883: Council-specific Selenium and parsing enhancements (MidlothianCouncil, Bromley, Wakefield, and multiple council scrapers) that overlap with this March 2026 Release aggregation.

Suggested reviewers

dp247

Poem

🐰 Five councils now scrape with grace,
Selenium fills the form's embrace,
Cloudflare waits, banners flee,
Bins sorted by date—wild and free!
Fuzzy matches and date parsing spree,
Our hoppy bot did set them glee! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title directly references MidlothianCouncil and the MyMidlothian Granicus fillform rewrite, which matches the main objective of replacing the old scraper with a new flow targeting the MyMidlothian service page.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

The directory_search.php postcode endpoint the old scraper used is gone; Midlothian migrated to a MyMidlothian Granicus fillform iframe at my.midlothian.gov.uk. Rewrote the scraper to: - Load the Bin_Collection_Dates service page - Switch into the fillform-frame-1 iframe - Fill the postcode field (dropdown auto-populates on change) - Select the matching address — this auto-fills six per-bin date fields (dateRecycling, dateFood, dateGarden, dateCard, dateGlass, dateResidual) - Read the dates directly from those fields, parse them as dd/mm/yyyy - No submit button needed The fillform iframe detects headless Chrome and vanilla Selenium and refuses to populate the dropdown, so when a DISPLAY is available the scraper now uses undetected_chromedriver in non-headless mode.

coderabbitai

Actionable comments posted: 10

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

uk_bin_collection/uk_bin_collection/councils/StocktonOnTeesCouncil.py (1)
73-83: ⚠️ Potential issue | 🟠 Major

Avoid interpolating raw address text into XPath.

This selector is built from user_paon directly. A house name containing ' will produce an invalid XPath, and contains(., ...) can also select the wrong option when one address string is a substring of another. Prefer reading the <select> options in Python via Select instead of concatenating user input into XPath.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@uk_bin_collection/uk_bin_collection/councils/StocktonOnTeesCouncil.py` around
lines 73 - 83, The current XPath concatenates user_paon into a locator which
breaks on quotes and can match substrings; replace the WebDriverWait+XPath block
that references user_paon with Selenium's Select API: find the select element by
its ID (LOOKUPBINDATESBYADDRESSSKIPOUTOFREGIONV2_ADDRESSLOOKUPADDRESS), create a
Select from it, iterate its options and choose the one whose visible text
exactly matches user_paon (or sanitize/normalize whitespace before comparing),
then click/select that option; keep the WebDriverWait for the select element to
be clickable/visible before creating the Select to preserve timing.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@uk_bin_collection/uk_bin_collection/councils/ChichesterDistrictCouncil.py`:
- Around line 116-122: The _wait_for_cloudflare helper currently returns
silently after 25 polls even if the page title still shows the Cloudflare
challenge; change it so that after the polling loop, if the title still contains
"Just a moment" or "Attention" it raises a
selenium.common.exceptions.TimeoutException (with a clear message like
"Cloudflare challenge not cleared after timeout") instead of returning, so
callers immediately get a meaningful failure; update the
_wait_for_cloudflare(driver) function to perform the check after the loop and
raise TimeoutException when still blocked.
- Around line 190-193: The loop in ChichesterDistrictCouncil.py that checks
options (for opt in options: text = opt.text.strip(); if target in text.lower():
return text) is too permissive because substring matches like "1" will match
"10" or "21"; change the logic in the function that performs this fuzzy address
pass to use token-boundary matching (e.g., match whole tokens or use regex word
boundaries around target against text.lower()) or, alternatively, collect all
fuzzy matches into a list first and only return when there is exactly one unique
fuzzy hit; update the code that references options/opt/text/target accordingly
so ambiguous substring matches are not returned silently.

In `@uk_bin_collection/uk_bin_collection/councils/MidlothianCouncil.py`:
- Around line 117-127: The code currently swallows parsing errors and missing
data; change the logic in the block using driver.find_element(By.ID, field_id)
and datetime.strptime(raw, "%d/%m/%Y") so that you only continue when the
element is truly missing or the value is genuinely blank (empty/whitespace). Do
not continue on a ValueError from datetime.strptime; instead raise an exception
(or re-raise a ValueError) that includes the field_id and raw value so scraper
drift is visible. Keep the try/except around driver.find_element to handle a
missing element, but remove the broad except around parsing and replace it with
an explicit raise that surfaces field_id and raw when parsing fails.
- Around line 154-157: The current fuzzy fallback in MidlothianCouncil.py loops
over select.options (variables opt, text) and returns the first option whose
text.upper() contains target, which erroneously matches house_identifier "1"
inside "10" or "21"; change the logic in that loop to either 1) perform a
boundary-aware match (use token or regex word/prefix boundary against target on
text) or 2) collect all fuzzy matches and only return when exactly one match is
found (otherwise continue searching or raise/skip), and reference the variables
select.options, opt, text, target and the house_identifier usage to locate where
to apply the stricter match.
- Around line 56-67: The fallback forcibly passes a non-headless flag to
create_webdriver; change the hardcoded False to use the actual headless setting
(e.g. kwargs.get("headless")) so the fallback honors the caller's headless
intent. Locate the branch that sets driver (the undetected_chromedriver
try/except and the subsequent call create_webdriver(web_driver, False, None,
__name__)) and replace the False with the headless value derived from kwargs
(with a sensible default such as True if absent) so create_webdriver(web_driver,
headless, None, __name__) is called.

In `@uk_bin_collection/uk_bin_collection/councils/StocktonOnTeesCouncil.py`:
- Around line 38-49: The try/except around the cookie dismissal currently
swallows all exceptions; change it to only catch the missing-banner case by
catching selenium.common.exceptions.NoSuchElementException when calling
driver.find_element(By.XPATH, ...), so that if the element is not found we
silently continue but any failures from
driver.execute_script("arguments[0].click();", cookie_btn) or click interception
surface as errors; locate the cookie handling block using the symbols
driver.find_element, By.XPATH, cookie_btn and driver.execute_script in
StocktonOnTeesCouncil.py and replace the broad except Exception with an explicit
NoSuchElementException catch (importing it if needed).

In `@uk_bin_collection/uk_bin_collection/councils/TorridgeDistrictCouncil.py`:
- Around line 55-63: The code silently defaults inner_html to "" when
getRoundCalendarForUPRNResult is missing which hides upstream format changes;
modify the parsing around ElementTree.fromstring(...) and the result lookup so
that if result is None you raise an explicit exception (or return a failure)
including useful context such as resp.status_code and a trimmed resp.text (or
the raw envelope) and the XPath you attempted (getRoundCalendarForUPRNResult) so
callers can diagnose the changed SOAP envelope instead of treating it as "no
bins".
- Line 55: Replace the unsafe stdlib XML parse call by using defusedxml's secure
parser: import defusedxml.ElementTree (e.g., from defusedxml import ElementTree
as DefusedET) and change the parsing call at the site where dom =
ElementTree.fromstring(resp.text) (in TorridgeDistrictCouncil.py) to use
DefusedET.fromstring on the response bytes/content; ensure you import defusedxml
and parse resp.content (not raw text) so the SOAP XML is parsed safely against
XXE attacks.

In `@uk_bin_collection/uk_bin_collection/councils/WestSuffolkCouncil.py`:
- Around line 18-26: The code performs requests.get(api_url, headers=headers)
and then parses the body without checking HTTP status, which can lead to silent
failures; update the WestSuffolkCouncil fetch flow by calling
response.raise_for_status() immediately after response = requests.get(api_url,
headers=headers) so upstream 4xx/5xx errors raise and are visible instead of
producing an empty collection_tag and returning {"bins": []}; ensure this change
is applied in the method that builds headers/response and that any surrounding
try/except handles the raised HTTPError appropriately.

In `@uk_bin_collection/uk_bin_collection/councils/WyreCouncil.py`:
- Around line 64-69: When parsing the date in WyreCouncil.py, do not silently
continue on ValueError; instead raise an error that includes the context so
drift is visible. Replace the silent "continue" in the try/except around
datetime.strptime(date_text, "%A %d %B") with an exception raise that includes
heading_text and date_text (and optionally the original ValueError) so
callers/logs show which heading and raw date failed; reference the variables
date_text, collection_date, heading_text and the helper
remove_ordinal_indicator_from_date_string to locate the parsing logic and update
the except block accordingly.

---

Outside diff comments:
In `@uk_bin_collection/uk_bin_collection/councils/StocktonOnTeesCouncil.py`:
- Around line 73-83: The current XPath concatenates user_paon into a locator
which breaks on quotes and can match substrings; replace the WebDriverWait+XPath
block that references user_paon with Selenium's Select API: find the select
element by its ID
(LOOKUPBINDATESBYADDRESSSKIPOUTOFREGIONV2_ADDRESSLOOKUPADDRESS), create a Select
from it, iterate its options and choose the one whose visible text exactly
matches user_paon (or sanitize/normalize whitespace before comparing), then
click/select that option; keep the WebDriverWait for the select element to be
clickable/visible before creating the Select to preserve timing.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0cd83b4f-2a0c-4e28-a678-82b4c57294da

📥 Commits

Reviewing files that changed from the base of the PR and between 60bd3cc and f89b6e2.

📒 Files selected for processing (6)

uk_bin_collection/uk_bin_collection/councils/ChichesterDistrictCouncil.py
uk_bin_collection/uk_bin_collection/councils/MidlothianCouncil.py
uk_bin_collection/uk_bin_collection/councils/StocktonOnTeesCouncil.py
uk_bin_collection/uk_bin_collection/councils/TorridgeDistrictCouncil.py
uk_bin_collection/uk_bin_collection/councils/WestSuffolkCouncil.py
uk_bin_collection/uk_bin_collection/councils/WyreCouncil.py

coderabbitai · 2026-04-12T13:38:56Z

uk_bin_collection/uk_bin_collection/councils/ChichesterDistrictCouncil.py

+    def _wait_for_cloudflare(self, driver):
+        for _ in range(25):
+            title = driver.title or ""
+            if "Just a moment" not in title and "Attention" not in title:
+                return
+            time.sleep(2)
+


⚠️ Potential issue | 🟠 Major

Raise when Cloudflare never clears.

This helper returns after 25 polls even if the page is still on the challenge screen, so the real failure turns into a later “element not found”. End the loop with a TimeoutException while the title still indicates Cloudflare.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@uk_bin_collection/uk_bin_collection/councils/ChichesterDistrictCouncil.py` around lines 116 - 122, The _wait_for_cloudflare helper currently returns silently after 25 polls even if the page title still shows the Cloudflare challenge; change it so that after the polling loop, if the title still contains "Just a moment" or "Attention" it raises a selenium.common.exceptions.TimeoutException (with a clear message like "Cloudflare challenge not cleared after timeout") instead of returning, so callers immediately get a meaningful failure; update the _wait_for_cloudflare(driver) function to perform the check after the loop and raise TimeoutException when still blocked.

coderabbitai · 2026-04-12T13:38:56Z

uk_bin_collection/uk_bin_collection/councils/ChichesterDistrictCouncil.py

+        for opt in options:
+            text = opt.text.strip()
+            if target in text.lower():
+                return text


⚠️ Potential issue | 🟠 Major

The fuzzy address pass is too permissive.

if target in text.lower() will happily map 1 to 10 ... or 21 ... when the strict prefix match misses. That's a silent wrong-property result, not just a missed match. Prefer token-boundary matching or require a unique fuzzy hit before returning it.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@uk_bin_collection/uk_bin_collection/councils/ChichesterDistrictCouncil.py` around lines 190 - 193, The loop in ChichesterDistrictCouncil.py that checks options (for opt in options: text = opt.text.strip(); if target in text.lower(): return text) is too permissive because substring matches like "1" will match "10" or "21"; change the logic in the function that performs this fuzzy address pass to use token-boundary matching (e.g., match whole tokens or use regex word boundaries around target against text.lower()) or, alternatively, collect all fuzzy matches into a list first and only return when there is exactly one unique fuzzy hit; update the code that references options/opt/text/target accordingly so ambiguous substring matches are not returned silently.

coderabbitai · 2026-04-12T13:38:56Z

uk_bin_collection/uk_bin_collection/councils/MidlothianCouncil.py

+            if os.environ.get("DISPLAY") and web_driver is None:
+                try:
+                    import undetected_chromedriver as uc
+                    uc_opts = uc.ChromeOptions()
+                    uc_opts.add_argument("--no-sandbox")
+                    uc_opts.add_argument("--disable-dev-shm-usage")
+                    uc_opts.add_argument("--window-size=1920,1080")
+                    driver = uc.Chrome(options=uc_opts, version_main=146)
+                except Exception:
+                    driver = None
+            if driver is None:
+                driver = create_webdriver(web_driver, False, None, __name__)


⚠️ Potential issue | 🟠 Major

The fallback path is still forced non-headless.

In uk_bin_collection/uk_bin_collection/common.py:321-360, the second create_webdriver() argument controls headless mode. Passing False unconditionally here ignores kwargs["headless"] and makes the no-DISPLAY fallback try to launch a visible browser anyway, which is exactly the environment this branch is meant to cover.

🧰 Tools

🪛 Ruff (0.15.9)

[warning] 64-64: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@uk_bin_collection/uk_bin_collection/councils/MidlothianCouncil.py` around lines 56 - 67, The fallback forcibly passes a non-headless flag to create_webdriver; change the hardcoded False to use the actual headless setting (e.g. kwargs.get("headless")) so the fallback honors the caller's headless intent. Locate the branch that sets driver (the undetected_chromedriver try/except and the subsequent call create_webdriver(web_driver, False, None, __name__)) and replace the False with the headless value derived from kwargs (with a sensible default such as True if absent) so create_webdriver(web_driver, headless, None, __name__) is called.

coderabbitai · 2026-04-12T13:38:56Z

uk_bin_collection/uk_bin_collection/councils/MidlothianCouncil.py

+                try:
+                    value = driver.find_element(By.ID, field_id).get_attribute("value")
+                except Exception:
+                    continue
+                if not value:
+                    continue
+                raw = value.strip()
+                try:
+                    parsed = datetime.strptime(raw, "%d/%m/%Y")
+                except ValueError:
+                    continue


⚠️ Potential issue | 🟠 Major

Only genuinely blank optional fields should be skipped.

A missing element or a non-empty value that no longer matches %d/%m/%Y is scraper drift, not “no collection”. The current continue can silently drop one or more bins and still report success. Raise with field_id/raw for those cases, and only ignore truly blank values. Based on learnings: prefer explicit failures over silent defaults or swallowed errors when parsing council bin collection data.

🧰 Tools

🪛 Ruff (0.15.9)

[error] 119-120: try-except-continue detected, consider logging the exception

(S112)

[warning] 119-119: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@uk_bin_collection/uk_bin_collection/councils/MidlothianCouncil.py` around lines 117 - 127, The code currently swallows parsing errors and missing data; change the logic in the block using driver.find_element(By.ID, field_id) and datetime.strptime(raw, "%d/%m/%Y") so that you only continue when the element is truly missing or the value is genuinely blank (empty/whitespace). Do not continue on a ValueError from datetime.strptime; instead raise an exception (or re-raise a ValueError) that includes the field_id and raw value so scraper drift is visible. Keep the try/except around driver.find_element to handle a missing element, but remove the broad except around parsing and replace it with an explicit raise that surfaces field_id and raw when parsing fails.

coderabbitai · 2026-04-12T13:38:56Z

uk_bin_collection/uk_bin_collection/councils/MidlothianCouncil.py

+        for opt in select.options:
+            text = opt.text.strip()
+            if target in text.upper():
+                return text


⚠️ Potential issue | 🟠 Major

The substring fallback can select the wrong property.

If house_identifier is 1, this pass also matches 10, 21, etc. That can return another address's dates with no error. Tighten the fuzzy match to a token/prefix boundary, or require exactly one fuzzy hit before returning it.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@uk_bin_collection/uk_bin_collection/councils/MidlothianCouncil.py` around lines 154 - 157, The current fuzzy fallback in MidlothianCouncil.py loops over select.options (variables opt, text) and returns the first option whose text.upper() contains target, which erroneously matches house_identifier "1" inside "10" or "21"; change the logic in that loop to either 1) perform a boundary-aware match (use token or regex word/prefix boundary against target on text) or 2) collect all fuzzy matches and only return when exactly one match is found (otherwise continue searching or raise/skip), and reference the variables select.options, opt, text, target and the house_identifier usage to locate where to apply the stricter match.

coderabbitai · 2026-04-12T13:38:56Z

uk_bin_collection/uk_bin_collection/councils/StocktonOnTeesCouncil.py

+            # Dismiss cookie consent banner (clicks get intercepted otherwise)
+            try:
+                time.sleep(2)
+                cookie_btn = driver.find_element(
+                    By.XPATH,
+                    "//button[contains(translate(., 'ABCDEFGHIJKLMNOPQRSTUVWXYZ',"
+                    "'abcdefghijklmnopqrstuvwxyz'),'accept')]",
+                )
+                driver.execute_script("arguments[0].click();", cookie_btn)
+                time.sleep(1)
+            except Exception:
+                pass


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

fd StocktonOnTeesCouncil.py

Repository: robbrad/UKBinCollectionData

Length of output: 140

🏁 Script executed:

cat -n uk_bin_collection/uk_bin_collection/councils/StocktonOnTeesCouncil.py | sed -n '1,100p'

Repository: robbrad/UKBinCollectionData

Length of output: 4467

🏁 Script executed:

cat -n uk_bin_collection/uk_bin_collection/councils/StocktonOnTeesCouncil.py | sed -n '38,160p'

Repository: robbrad/UKBinCollectionData

Length of output: 5772

🏁 Script executed:

rg "NoSuchElementException|find_element" uk_bin_collection/uk_bin_collection/councils/StocktonOnTeesCouncil.py

Repository: robbrad/UKBinCollectionData

Length of output: 120

🏁 Script executed:

cat -n uk_bin_collection/uk_bin_collection/councils/StocktonOnTeesCouncil.py | tail -20

Repository: robbrad/UKBinCollectionData

Length of output: 953

Catch only the "banner not present" exception; let click failures surface.

The broad except Exception: pass masks real click or script execution failures during cookie banner dismissal. If the banner exists but the execute_script() call fails (e.g., due to click interception), the error is silently swallowed and execution continues. This makes production debugging harder—subsequent timeout errors won't reveal the root cause was a cookie banner interaction failure. Catch NoSuchElementException explicitly when the banner is missing, and let click/script failures propagate to surface at the actual point of failure.

🧰 Tools

🪛 Ruff (0.15.9)

[error] 48-49: try-except-pass detected, consider logging the exception

(S110)

[warning] 48-48: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@uk_bin_collection/uk_bin_collection/councils/StocktonOnTeesCouncil.py` around lines 38 - 49, The try/except around the cookie dismissal currently swallows all exceptions; change it to only catch the missing-banner case by catching selenium.common.exceptions.NoSuchElementException when calling driver.find_element(By.XPATH, ...), so that if the element is not found we silently continue but any failures from driver.execute_script("arguments[0].click();", cookie_btn) or click interception surface as errors; locate the cookie handling block using the symbols driver.find_element, By.XPATH, cookie_btn and driver.execute_script in StocktonOnTeesCouncil.py and replace the broad except Exception with an explicit NoSuchElementException catch (importing it if needed).

coderabbitai · 2026-04-12T13:38:56Z

uk_bin_collection/uk_bin_collection/councils/TorridgeDistrictCouncil.py

+        dom = ElementTree.fromstring(resp.text)
+        result = dom.find(
            "./soap:Body"
            "/a:getRoundCalendarForUPRNResponse"
            "/a:getRoundCalendarForUPRNResult",
            namespaces,
        )
-        # Make a BS4 object
-        soup = BeautifulSoup(page.text, features="html.parser")
-        soup.prettify()
+        inner_html = result.text if result is not None else ""



⚠️ Potential issue | 🟠 Major

Don't convert a missing SOAP result into an empty schedule.

If getRoundCalendarForUPRNResult disappears or the service returns a different envelope, inner_html = "" makes the scraper report “no bins” instead of surfacing the upstream break. Raise here with enough response context to diagnose the format change. Based on learnings: prefer explicit failures over silent defaults or swallowed errors when parsing council bin collection data.

🧰 Tools

🪛 Ruff (0.15.9)

[error] 55-55: Using xml to parse untrusted data is known to be vulnerable to XML attacks; use defusedxml equivalents

(S314)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@uk_bin_collection/uk_bin_collection/councils/TorridgeDistrictCouncil.py` around lines 55 - 63, The code silently defaults inner_html to "" when getRoundCalendarForUPRNResult is missing which hides upstream format changes; modify the parsing around ElementTree.fromstring(...) and the result lookup so that if result is None you raise an explicit exception (or return a failure) including useful context such as resp.status_code and a trimmed resp.text (or the raw envelope) and the XPath you attempted (getRoundCalendarForUPRNResult) so callers can diagnose the changed SOAP envelope instead of treating it as "no bins".

coderabbitai · 2026-04-12T13:38:56Z

uk_bin_collection/uk_bin_collection/councils/TorridgeDistrictCouncil.py

        }
-        dom = ElementTree.fromstring(page.text)
-        page = dom.find(
+        dom = ElementTree.fromstring(resp.text)


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Locate and read the file find . -name "TorridgeDistrictCouncil.py" -type f

Repository: robbrad/UKBinCollectionData

Length of output: 144

🏁 Script executed:

# Once found, read the file to see the context around line 55 wc -l ./uk_bin_collection/uk_bin_collection/councils/TorridgeDistrictCouncil.py

Repository: robbrad/UKBinCollectionData

Length of output: 148

🏁 Script executed:

# Read lines around 55 with context sed -n '1,70p' ./uk_bin_collection/uk_bin_collection/councils/TorridgeDistrictCouncil.py | cat -n

Repository: robbrad/UKBinCollectionData

Length of output: 3189

🏁 Script executed:

# Check imports and defusedxml usage in the file head -30 ./uk_bin_collection/uk_bin_collection/councils/TorridgeDistrictCouncil.py | cat -n

Repository: robbrad/UKBinCollectionData

Length of output: 1371

🏁 Script executed:

# Search for defusedxml in the entire codebase rg "defusedxml|from_string|ElementTree" --type py -i

Repository: robbrad/UKBinCollectionData

Length of output: 610

🏁 Script executed:

# Check for requirements or dependencies files find . -name "requirements*.txt" -o -name "setup.py" -o -name "pyproject.toml" -o -name "Pipfile" | head -10

Repository: robbrad/UKBinCollectionData

Length of output: 135

🏁 Script executed:

# Check common dependency files cat ./requirements.txt 2>/dev/null || cat ./setup.py 2>/dev/null || cat ./pyproject.toml 2>/dev/null | head -50

Repository: robbrad/UKBinCollectionData

Length of output: 1208

🏁 Script executed:

# Search for any defusedxml usage in the entire codebase rg "defusedxml" --type py

Repository: robbrad/UKBinCollectionData

Length of output: 53

🏁 Script executed:

# Get the full dependencies section from pyproject.toml cat ./pyproject.toml | grep -A 100 "dependencies"

Repository: robbrad/UKBinCollectionData

Length of output: 1561

🏁 Script executed:

# Check if defusedxml might be imported elsewhere rg "import.*xml" --type py | head -20

Repository: robbrad/UKBinCollectionData

Length of output: 158

Use a hardened XML parser for the SOAP envelope.

ElementTree.fromstring(resp.text) is parsing untrusted remote XML from an external SOAP API. This is vulnerable to XXE (XML External Entity) attacks. Use defusedxml instead of the stdlib parser to safely parse the XML response.

🧰 Tools

🪛 Ruff (0.15.9)

[error] 55-55: Using xml to parse untrusted data is known to be vulnerable to XML attacks; use defusedxml equivalents

(S314)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@uk_bin_collection/uk_bin_collection/councils/TorridgeDistrictCouncil.py` at line 55, Replace the unsafe stdlib XML parse call by using defusedxml's secure parser: import defusedxml.ElementTree (e.g., from defusedxml import ElementTree as DefusedET) and change the parsing call at the site where dom = ElementTree.fromstring(resp.text) (in TorridgeDistrictCouncil.py) to use DefusedET.fromstring on the response bytes/content; ensure you import defusedxml and parse resp.content (not raw text) so the SOAP XML is parsed safely against XXE attacks.

coderabbitai · 2026-04-12T13:38:56Z

uk_bin_collection/uk_bin_collection/councils/WestSuffolkCouncil.py

+        # WestSuffolk's IIS returns 404 to requests without a User-Agent,
+        # so send a realistic browser UA.
+        headers = {
+            "User-Agent": (
+                "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
+                "(KHTML, like Gecko) Chrome/134.0.0.0 Safari/537.36"
+            )
+        }
+        response = requests.get(api_url, headers=headers)


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n uk_bin_collection/uk_bin_collection/councils/WestSuffolkCouncil.py | head -50

Repository: robbrad/UKBinCollectionData

Length of output: 2049

🏁 Script executed:

cat -n uk_bin_collection/uk_bin_collection/councils/WestSuffolkCouncil.py | sed -n '26,80p'

Repository: robbrad/UKBinCollectionData

Length of output: 2486

🏁 Script executed:

cat -n uk_bin_collection/uk_bin_collection/councils/WestSuffolkCouncil.py | tail -20

Repository: robbrad/UKBinCollectionData

Length of output: 965

Add explicit HTTP status check before parsing.

Line 26 fetches the response without validating the status code. If West Suffolk returns a 4xx/5xx error, line 28 parses the error page HTML, which won't contain the expected structure. This causes collection_tag to be empty (line 50), and the method silently returns {"bins": []} with no indication of failure. Add response.raise_for_status() immediately after line 26 to fail explicitly on upstream HTTP errors.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@uk_bin_collection/uk_bin_collection/councils/WestSuffolkCouncil.py` around lines 18 - 26, The code performs requests.get(api_url, headers=headers) and then parses the body without checking HTTP status, which can lead to silent failures; update the WestSuffolkCouncil fetch flow by calling response.raise_for_status() immediately after response = requests.get(api_url, headers=headers) so upstream 4xx/5xx errors raise and are visible instead of producing an empty collection_tag and returning {"bins": []}; ensure this change is applied in the method that builds headers/response and that any surrounding try/except handles the raised HTTPError appropriately.

coderabbitai · 2026-04-12T13:38:56Z

uk_bin_collection/uk_bin_collection/councils/WyreCouncil.py

+            date_text = " ".join(content.get_text(" ", strip=True).split())
+            date_text = remove_ordinal_indicator_from_date_string(date_text)
+            try:
+                collection_date = datetime.strptime(date_text, "%A %d %B")
+            except ValueError:
+                continue


⚠️ Potential issue | 🟠 Major

Don't silently drop a matched bin when its date stops parsing.

Once heading_text matched Your next ... collection, a ValueError here means the council changed the rendered date format. continue turns that into partial data instead of a clear scraper failure. Raise with heading_text/date_text so the drift is visible immediately. Based on learnings: prefer explicit failures over silent defaults or swallowed errors when parsing council bin collection data.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@uk_bin_collection/uk_bin_collection/councils/WyreCouncil.py` around lines 64 - 69, When parsing the date in WyreCouncil.py, do not silently continue on ValueError; instead raise an error that includes the context so drift is visible. Replace the silent "continue" in the try/except around datetime.strptime(date_text, "%A %d %B") with an exception raise that includes heading_text and date_text (and optionally the original ValueError) so callers/logs show which heading and raw date failed; reference the variables date_text, collection_date, heading_text and the helper remove_ordinal_indicator_from_date_string to locate the parsing logic and update the except block accordingly.

InertiaUK added 5 commits April 12, 2026 09:37

Fix WestSuffolkCouncil: send User-Agent header

fac7731

The MyWestSuffolk.aspx IIS endpoint returns a 404 page to requests without a User-Agent header. Adding a realistic Chrome UA restores the full response (57KB), letting the existing parser pick up the bin collection panel correctly.

InertiaUK force-pushed the fix/midlothian-fillform-iframe branch from c932580 to f89b6e2 Compare April 12, 2026 13:32

coderabbitai bot reviewed Apr 12, 2026

View reviewed changes

Conversation

InertiaUK commented Apr 12, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this changes

New flow

Bot detection note

Test

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

InertiaUK commented Apr 12, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 12, 2026 •

edited

Loading