Skip to content

fix: MidlothianCouncil - rewrite for MyMidlothian Granicus fillform#1956

Open
InertiaUK wants to merge 6 commits intorobbrad:masterfrom
InertiaUK:fix/midlothian-fillform-iframe
Open

fix: MidlothianCouncil - rewrite for MyMidlothian Granicus fillform#1956
InertiaUK wants to merge 6 commits intorobbrad:masterfrom
InertiaUK:fix/midlothian-fillform-iframe

Conversation

@InertiaUK
Copy link
Copy Markdown

@InertiaUK InertiaUK commented Apr 12, 2026

What this changes

Midlothian's directory_search.php?directoryID=35 endpoint is gone. The bin collection service has moved to a new MyMidlothian self-service portal at https://my.midlothian.gov.uk/service/Bin_Collection_Dates, which is a Granicus fillform rendered inside an iframe.

New flow

The old scraper's multi-step search-results -> address-link -> ul.data-table navigation no longer applies. The new layout:

  1. Load the service page and switch into the fillform-frame-1 iframe.
  2. Type the postcode into the postcode input — the listAddress <select> dropdown auto-populates via AJAX (no explicit search button).
  3. Select the matching address by visible text. Selecting an address auto-fills six per-bin date fields in the iframe: dateRecycling, dateFood, dateGarden, dateCard, dateGlass, dateResidual. No submit step.
  4. Read each date field's value attribute — they're already in dd/mm/yyyy format.

Bot detection note

The fillform iframe detects both headless Chrome and vanilla Selenium (presumably via navigator.webdriver) and refuses to populate the address dropdown. The scraper now uses undetected_chromedriver in non-headless mode when a DISPLAY environment variable is available (e.g. Xvfb on a headless server), and falls back to create_webdriver otherwise.

Test

Verified via the VPS wrapper against EH19 2EB #52:

{
  "bins": [
    { "type": "Recycling",     "collectionDate": "13/04/2026" },
    { "type": "Food waste",    "collectionDate": "15/04/2026" },
    { "type": "Garden waste",  "collectionDate": "17/04/2026" },
    { "type": "Cardboard",     "collectionDate": "20/04/2026" },
    { "type": "Glass",         "collectionDate": "24/04/2026" },
    { "type": "General waste", "collectionDate": "27/04/2026" }
  ]
}

Summary by CodeRabbit

  • Bug Fixes
    • Enhanced collection data retrieval reliability across multiple council integrations (Chichester, Midlothian, Stockton-on-Tees, Torridge, and Wyre).
    • Improved address selection and web form interaction robustness.
    • Strengthened input validation and error handling for missing data.

Torridge's SOAP API changed its response from explicit dates ("Mon 14 Apr")
to relative phrases ("Tomorrow then every Mon", "Today then every Tue")
with an embedded calendar table. The old regex-based parser returned empty
because it expected the old format.

The rewritten parser handles Today/Tomorrow/weekday-name phrases, falls back
to the old explicit format if it reappears, and gracefully skips "No X
collection for this address" entries.
The Wyre bin collection page now includes a 'Download calendar' box
among the .boxed divs. This box has no h3.bin-collection-tasks__heading,
which caused the scraper to crash on .text access.

Changes:
- Skip boxes missing the heading or content container
- Use regex to extract the bin name from 'Your next X collection'
- Collapse whitespace on the date text before strptime (p tags produced
  multi-whitespace runs after .text)
- Roll year forward only when the computed date is in the past, instead
  of only handling the December->January edge case
…e results

The checkyourbinday page is now:
- Gated by Cloudflare Turnstile (requires non-headless UC)
- Using WASTECOLLECTIONCALENDARV7_* element IDs (was V5)
- Rendering collection dates inline as .gi-summary-blocklist__row
  divs after address selection — no separate NEXT/submit step

Changes:
- Update all element IDs from V5 to V7
- Add Cloudflare challenge wait loop (up to 50s)
- Dismiss cookie consent before interaction (was blocking button clicks)
- Replace old table-based parsing with .gi-summary-blocklist__row scraping
- Use Select.select_by_visible_text with stale-retry instead of manual
  option.click() loop (which crashed on AJAX re-renders)
- Remove the smart_select_address helper's dead fuzzy/strict split
…anner

Stockton's AchieveForms form name changed from LOOKUPBINDATESBYADDRESSSKIPOUTOFREGION
to LOOKUPBINDATESBYADDRESSSKIPOUTOFREGIONV2, so every element ID needed
the V2 suffix added. Also added a cookie-banner dismissal step — the
banner was covering the search button and intercepting clicks.
The MyWestSuffolk.aspx IIS endpoint returns a 404 page to requests
without a User-Agent header. Adding a realistic Chrome UA restores the
full response (57KB), letting the existing parser pick up the bin
collection panel correctly.
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 12, 2026

📝 Walkthrough

Walkthrough

Updated five UK council bin collection scrapers with modernized integrations: Chichester and Midlothian migrated to Selenium-based form automation with enhanced robustness (Cloudflare detection, cookie dismissal, dropdown retry logic); Stockton updated to V2 element IDs; Torridge refactored SOAP/HTML parsing with relative date handling; WestSuffolk added User-Agent header.

Changes

Cohort / File(s) Summary
Chichester Modernization
uk_bin_collection/.../ChichesterDistrictCouncil.py
Migrated element IDs from V5 to V7; added Cloudflare Turnstile detection and cookie banner dismissal; refactored address selection with fuzzy matching via _pick_option; replaced table-based result parsing with div.gi-summary-blocklist structure; added stale element retry logic (3 attempts) on dropdown selection.
Midlothian HTTP→Selenium Migration
uk_bin_collection/.../MidlothianCouncil.py
Replaced HTTP+BeautifulSoup directory search with Selenium iframe form-filling; removed URL construction and HTML helpers; added postcode/address selection via _pick_address_option; reads bin dates from specific field IDs; added optional undetected_chromedriver fallback; deterministic driver cleanup in finally block; raises ValueError for missing required fields.
Stockton Locator Updates
uk_bin_collection/.../StocktonOnTeesCouncil.py
Added cookie consent banner dismissal (non-blocking); updated Selenium locator IDs from LOOKUPBINDATESBYADDRESSSKIPOUTOFREGION_* to ...V2_* variants; retained user_paon matching logic in address XPath.
Torridge Date Parsing Refactor
uk_bin_collection/.../TorridgeDistrictCouncil.py
Added explicit SOAPAction header to SOAP requests; replaced hardcoded bin-label extraction (regex over specific labels) with generalized loop over <b>/<B> elements with filtering (sentinel sections, year patterns, "No collection" entries); introduced _extract_base_date helper to parse relative dates ("today", weekday phrases) and explicit "DayNum MonthName" patterns (current/next year fallback).
WestSuffolk User-Agent
uk_bin_collection/.../WestSuffolkCouncil.py
Added browser-like User-Agent header to HTTP request to MyWestSuffolk.aspx.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~65 minutes

Possibly related PRs

  • March 2026 Release #1883: Council-specific Selenium and parsing enhancements (MidlothianCouncil, Bromley, Wakefield, and multiple council scrapers) that overlap with this March 2026 Release aggregation.

Suggested reviewers

  • dp247

Poem

🐰 Five councils now scrape with grace,
Selenium fills the form's embrace,
Cloudflare waits, banners flee,
Bins sorted by date—wild and free!
Fuzzy matches and date parsing spree,
Our hoppy bot did set them glee! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title directly references MidlothianCouncil and the MyMidlothian Granicus fillform rewrite, which matches the main objective of replacing the old scraper with a new flow targeting the MyMidlothian service page.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

The directory_search.php postcode endpoint the old scraper used is gone;
Midlothian migrated to a MyMidlothian Granicus fillform iframe at
my.midlothian.gov.uk. Rewrote the scraper to:

- Load the Bin_Collection_Dates service page
- Switch into the fillform-frame-1 iframe
- Fill the postcode field (dropdown auto-populates on change)
- Select the matching address — this auto-fills six per-bin date fields
  (dateRecycling, dateFood, dateGarden, dateCard, dateGlass, dateResidual)
- Read the dates directly from those fields, parse them as dd/mm/yyyy
- No submit button needed

The fillform iframe detects headless Chrome and vanilla Selenium and
refuses to populate the dropdown, so when a DISPLAY is available the
scraper now uses undetected_chromedriver in non-headless mode.
@InertiaUK InertiaUK force-pushed the fix/midlothian-fillform-iframe branch from c932580 to f89b6e2 Compare April 12, 2026 13:32
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
uk_bin_collection/uk_bin_collection/councils/StocktonOnTeesCouncil.py (1)

73-83: ⚠️ Potential issue | 🟠 Major

Avoid interpolating raw address text into XPath.

This selector is built from user_paon directly. A house name containing ' will produce an invalid XPath, and contains(., ...) can also select the wrong option when one address string is a substring of another. Prefer reading the <select> options in Python via Select instead of concatenating user input into XPath.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@uk_bin_collection/uk_bin_collection/councils/StocktonOnTeesCouncil.py` around
lines 73 - 83, The current XPath concatenates user_paon into a locator which
breaks on quotes and can match substrings; replace the WebDriverWait+XPath block
that references user_paon with Selenium's Select API: find the select element by
its ID (LOOKUPBINDATESBYADDRESSSKIPOUTOFREGIONV2_ADDRESSLOOKUPADDRESS), create a
Select from it, iterate its options and choose the one whose visible text
exactly matches user_paon (or sanitize/normalize whitespace before comparing),
then click/select that option; keep the WebDriverWait for the select element to
be clickable/visible before creating the Select to preserve timing.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@uk_bin_collection/uk_bin_collection/councils/ChichesterDistrictCouncil.py`:
- Around line 116-122: The _wait_for_cloudflare helper currently returns
silently after 25 polls even if the page title still shows the Cloudflare
challenge; change it so that after the polling loop, if the title still contains
"Just a moment" or "Attention" it raises a
selenium.common.exceptions.TimeoutException (with a clear message like
"Cloudflare challenge not cleared after timeout") instead of returning, so
callers immediately get a meaningful failure; update the
_wait_for_cloudflare(driver) function to perform the check after the loop and
raise TimeoutException when still blocked.
- Around line 190-193: The loop in ChichesterDistrictCouncil.py that checks
options (for opt in options: text = opt.text.strip(); if target in text.lower():
return text) is too permissive because substring matches like "1" will match
"10" or "21"; change the logic in the function that performs this fuzzy address
pass to use token-boundary matching (e.g., match whole tokens or use regex word
boundaries around target against text.lower()) or, alternatively, collect all
fuzzy matches into a list first and only return when there is exactly one unique
fuzzy hit; update the code that references options/opt/text/target accordingly
so ambiguous substring matches are not returned silently.

In `@uk_bin_collection/uk_bin_collection/councils/MidlothianCouncil.py`:
- Around line 117-127: The code currently swallows parsing errors and missing
data; change the logic in the block using driver.find_element(By.ID, field_id)
and datetime.strptime(raw, "%d/%m/%Y") so that you only continue when the
element is truly missing or the value is genuinely blank (empty/whitespace). Do
not continue on a ValueError from datetime.strptime; instead raise an exception
(or re-raise a ValueError) that includes the field_id and raw value so scraper
drift is visible. Keep the try/except around driver.find_element to handle a
missing element, but remove the broad except around parsing and replace it with
an explicit raise that surfaces field_id and raw when parsing fails.
- Around line 154-157: The current fuzzy fallback in MidlothianCouncil.py loops
over select.options (variables opt, text) and returns the first option whose
text.upper() contains target, which erroneously matches house_identifier "1"
inside "10" or "21"; change the logic in that loop to either 1) perform a
boundary-aware match (use token or regex word/prefix boundary against target on
text) or 2) collect all fuzzy matches and only return when exactly one match is
found (otherwise continue searching or raise/skip), and reference the variables
select.options, opt, text, target and the house_identifier usage to locate where
to apply the stricter match.
- Around line 56-67: The fallback forcibly passes a non-headless flag to
create_webdriver; change the hardcoded False to use the actual headless setting
(e.g. kwargs.get("headless")) so the fallback honors the caller's headless
intent. Locate the branch that sets driver (the undetected_chromedriver
try/except and the subsequent call create_webdriver(web_driver, False, None,
__name__)) and replace the False with the headless value derived from kwargs
(with a sensible default such as True if absent) so create_webdriver(web_driver,
headless, None, __name__) is called.

In `@uk_bin_collection/uk_bin_collection/councils/StocktonOnTeesCouncil.py`:
- Around line 38-49: The try/except around the cookie dismissal currently
swallows all exceptions; change it to only catch the missing-banner case by
catching selenium.common.exceptions.NoSuchElementException when calling
driver.find_element(By.XPATH, ...), so that if the element is not found we
silently continue but any failures from
driver.execute_script("arguments[0].click();", cookie_btn) or click interception
surface as errors; locate the cookie handling block using the symbols
driver.find_element, By.XPATH, cookie_btn and driver.execute_script in
StocktonOnTeesCouncil.py and replace the broad except Exception with an explicit
NoSuchElementException catch (importing it if needed).

In `@uk_bin_collection/uk_bin_collection/councils/TorridgeDistrictCouncil.py`:
- Around line 55-63: The code silently defaults inner_html to "" when
getRoundCalendarForUPRNResult is missing which hides upstream format changes;
modify the parsing around ElementTree.fromstring(...) and the result lookup so
that if result is None you raise an explicit exception (or return a failure)
including useful context such as resp.status_code and a trimmed resp.text (or
the raw envelope) and the XPath you attempted (getRoundCalendarForUPRNResult) so
callers can diagnose the changed SOAP envelope instead of treating it as "no
bins".
- Line 55: Replace the unsafe stdlib XML parse call by using defusedxml's secure
parser: import defusedxml.ElementTree (e.g., from defusedxml import ElementTree
as DefusedET) and change the parsing call at the site where dom =
ElementTree.fromstring(resp.text) (in TorridgeDistrictCouncil.py) to use
DefusedET.fromstring on the response bytes/content; ensure you import defusedxml
and parse resp.content (not raw text) so the SOAP XML is parsed safely against
XXE attacks.

In `@uk_bin_collection/uk_bin_collection/councils/WestSuffolkCouncil.py`:
- Around line 18-26: The code performs requests.get(api_url, headers=headers)
and then parses the body without checking HTTP status, which can lead to silent
failures; update the WestSuffolkCouncil fetch flow by calling
response.raise_for_status() immediately after response = requests.get(api_url,
headers=headers) so upstream 4xx/5xx errors raise and are visible instead of
producing an empty collection_tag and returning {"bins": []}; ensure this change
is applied in the method that builds headers/response and that any surrounding
try/except handles the raised HTTPError appropriately.

In `@uk_bin_collection/uk_bin_collection/councils/WyreCouncil.py`:
- Around line 64-69: When parsing the date in WyreCouncil.py, do not silently
continue on ValueError; instead raise an error that includes the context so
drift is visible. Replace the silent "continue" in the try/except around
datetime.strptime(date_text, "%A %d %B") with an exception raise that includes
heading_text and date_text (and optionally the original ValueError) so
callers/logs show which heading and raw date failed; reference the variables
date_text, collection_date, heading_text and the helper
remove_ordinal_indicator_from_date_string to locate the parsing logic and update
the except block accordingly.

---

Outside diff comments:
In `@uk_bin_collection/uk_bin_collection/councils/StocktonOnTeesCouncil.py`:
- Around line 73-83: The current XPath concatenates user_paon into a locator
which breaks on quotes and can match substrings; replace the WebDriverWait+XPath
block that references user_paon with Selenium's Select API: find the select
element by its ID
(LOOKUPBINDATESBYADDRESSSKIPOUTOFREGIONV2_ADDRESSLOOKUPADDRESS), create a Select
from it, iterate its options and choose the one whose visible text exactly
matches user_paon (or sanitize/normalize whitespace before comparing), then
click/select that option; keep the WebDriverWait for the select element to be
clickable/visible before creating the Select to preserve timing.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0cd83b4f-2a0c-4e28-a678-82b4c57294da

📥 Commits

Reviewing files that changed from the base of the PR and between 60bd3cc and f89b6e2.

📒 Files selected for processing (6)
  • uk_bin_collection/uk_bin_collection/councils/ChichesterDistrictCouncil.py
  • uk_bin_collection/uk_bin_collection/councils/MidlothianCouncil.py
  • uk_bin_collection/uk_bin_collection/councils/StocktonOnTeesCouncil.py
  • uk_bin_collection/uk_bin_collection/councils/TorridgeDistrictCouncil.py
  • uk_bin_collection/uk_bin_collection/councils/WestSuffolkCouncil.py
  • uk_bin_collection/uk_bin_collection/councils/WyreCouncil.py

Comment on lines +116 to +122
def _wait_for_cloudflare(self, driver):
for _ in range(25):
title = driver.title or ""
if "Just a moment" not in title and "Attention" not in title:
return
time.sleep(2)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Raise when Cloudflare never clears.

This helper returns after 25 polls even if the page is still on the challenge screen, so the real failure turns into a later “element not found”. End the loop with a TimeoutException while the title still indicates Cloudflare.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@uk_bin_collection/uk_bin_collection/councils/ChichesterDistrictCouncil.py`
around lines 116 - 122, The _wait_for_cloudflare helper currently returns
silently after 25 polls even if the page title still shows the Cloudflare
challenge; change it so that after the polling loop, if the title still contains
"Just a moment" or "Attention" it raises a
selenium.common.exceptions.TimeoutException (with a clear message like
"Cloudflare challenge not cleared after timeout") instead of returning, so
callers immediately get a meaningful failure; update the
_wait_for_cloudflare(driver) function to perform the check after the loop and
raise TimeoutException when still blocked.

Comment on lines +190 to +193
for opt in options:
text = opt.text.strip()
if target in text.lower():
return text
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

The fuzzy address pass is too permissive.

if target in text.lower() will happily map 1 to 10 ... or 21 ... when the strict prefix match misses. That's a silent wrong-property result, not just a missed match. Prefer token-boundary matching or require a unique fuzzy hit before returning it.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@uk_bin_collection/uk_bin_collection/councils/ChichesterDistrictCouncil.py`
around lines 190 - 193, The loop in ChichesterDistrictCouncil.py that checks
options (for opt in options: text = opt.text.strip(); if target in text.lower():
return text) is too permissive because substring matches like "1" will match
"10" or "21"; change the logic in the function that performs this fuzzy address
pass to use token-boundary matching (e.g., match whole tokens or use regex word
boundaries around target against text.lower()) or, alternatively, collect all
fuzzy matches into a list first and only return when there is exactly one unique
fuzzy hit; update the code that references options/opt/text/target accordingly
so ambiguous substring matches are not returned silently.

Comment on lines +56 to +67
if os.environ.get("DISPLAY") and web_driver is None:
try:
import undetected_chromedriver as uc
uc_opts = uc.ChromeOptions()
uc_opts.add_argument("--no-sandbox")
uc_opts.add_argument("--disable-dev-shm-usage")
uc_opts.add_argument("--window-size=1920,1080")
driver = uc.Chrome(options=uc_opts, version_main=146)
except Exception:
driver = None
if driver is None:
driver = create_webdriver(web_driver, False, None, __name__)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

The fallback path is still forced non-headless.

In uk_bin_collection/uk_bin_collection/common.py:321-360, the second create_webdriver() argument controls headless mode. Passing False unconditionally here ignores kwargs["headless"] and makes the no-DISPLAY fallback try to launch a visible browser anyway, which is exactly the environment this branch is meant to cover.

🧰 Tools
🪛 Ruff (0.15.9)

[warning] 64-64: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@uk_bin_collection/uk_bin_collection/councils/MidlothianCouncil.py` around
lines 56 - 67, The fallback forcibly passes a non-headless flag to
create_webdriver; change the hardcoded False to use the actual headless setting
(e.g. kwargs.get("headless")) so the fallback honors the caller's headless
intent. Locate the branch that sets driver (the undetected_chromedriver
try/except and the subsequent call create_webdriver(web_driver, False, None,
__name__)) and replace the False with the headless value derived from kwargs
(with a sensible default such as True if absent) so create_webdriver(web_driver,
headless, None, __name__) is called.

Comment on lines +117 to +127
try:
value = driver.find_element(By.ID, field_id).get_attribute("value")
except Exception:
continue
if not value:
continue
raw = value.strip()
try:
parsed = datetime.strptime(raw, "%d/%m/%Y")
except ValueError:
continue
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Only genuinely blank optional fields should be skipped.

A missing element or a non-empty value that no longer matches %d/%m/%Y is scraper drift, not “no collection”. The current continue can silently drop one or more bins and still report success. Raise with field_id/raw for those cases, and only ignore truly blank values. Based on learnings: prefer explicit failures over silent defaults or swallowed errors when parsing council bin collection data.

🧰 Tools
🪛 Ruff (0.15.9)

[error] 119-120: try-except-continue detected, consider logging the exception

(S112)


[warning] 119-119: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@uk_bin_collection/uk_bin_collection/councils/MidlothianCouncil.py` around
lines 117 - 127, The code currently swallows parsing errors and missing data;
change the logic in the block using driver.find_element(By.ID, field_id) and
datetime.strptime(raw, "%d/%m/%Y") so that you only continue when the element is
truly missing or the value is genuinely blank (empty/whitespace). Do not
continue on a ValueError from datetime.strptime; instead raise an exception (or
re-raise a ValueError) that includes the field_id and raw value so scraper drift
is visible. Keep the try/except around driver.find_element to handle a missing
element, but remove the broad except around parsing and replace it with an
explicit raise that surfaces field_id and raw when parsing fails.

Comment on lines +154 to +157
for opt in select.options:
text = opt.text.strip()
if target in text.upper():
return text
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

The substring fallback can select the wrong property.

If house_identifier is 1, this pass also matches 10, 21, etc. That can return another address's dates with no error. Tighten the fuzzy match to a token/prefix boundary, or require exactly one fuzzy hit before returning it.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@uk_bin_collection/uk_bin_collection/councils/MidlothianCouncil.py` around
lines 154 - 157, The current fuzzy fallback in MidlothianCouncil.py loops over
select.options (variables opt, text) and returns the first option whose
text.upper() contains target, which erroneously matches house_identifier "1"
inside "10" or "21"; change the logic in that loop to either 1) perform a
boundary-aware match (use token or regex word/prefix boundary against target on
text) or 2) collect all fuzzy matches and only return when exactly one match is
found (otherwise continue searching or raise/skip), and reference the variables
select.options, opt, text, target and the house_identifier usage to locate where
to apply the stricter match.

Comment on lines +38 to +49
# Dismiss cookie consent banner (clicks get intercepted otherwise)
try:
time.sleep(2)
cookie_btn = driver.find_element(
By.XPATH,
"//button[contains(translate(., 'ABCDEFGHIJKLMNOPQRSTUVWXYZ',"
"'abcdefghijklmnopqrstuvwxyz'),'accept')]",
)
driver.execute_script("arguments[0].click();", cookie_btn)
time.sleep(1)
except Exception:
pass
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

fd StocktonOnTeesCouncil.py

Repository: robbrad/UKBinCollectionData

Length of output: 140


🏁 Script executed:

cat -n uk_bin_collection/uk_bin_collection/councils/StocktonOnTeesCouncil.py | sed -n '1,100p'

Repository: robbrad/UKBinCollectionData

Length of output: 4467


🏁 Script executed:

cat -n uk_bin_collection/uk_bin_collection/councils/StocktonOnTeesCouncil.py | sed -n '38,160p'

Repository: robbrad/UKBinCollectionData

Length of output: 5772


🏁 Script executed:

rg "NoSuchElementException|find_element" uk_bin_collection/uk_bin_collection/councils/StocktonOnTeesCouncil.py

Repository: robbrad/UKBinCollectionData

Length of output: 120


🏁 Script executed:

cat -n uk_bin_collection/uk_bin_collection/councils/StocktonOnTeesCouncil.py | tail -20

Repository: robbrad/UKBinCollectionData

Length of output: 953


Catch only the "banner not present" exception; let click failures surface.

The broad except Exception: pass masks real click or script execution failures during cookie banner dismissal. If the banner exists but the execute_script() call fails (e.g., due to click interception), the error is silently swallowed and execution continues. This makes production debugging harder—subsequent timeout errors won't reveal the root cause was a cookie banner interaction failure. Catch NoSuchElementException explicitly when the banner is missing, and let click/script failures propagate to surface at the actual point of failure.

🧰 Tools
🪛 Ruff (0.15.9)

[error] 48-49: try-except-pass detected, consider logging the exception

(S110)


[warning] 48-48: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@uk_bin_collection/uk_bin_collection/councils/StocktonOnTeesCouncil.py` around
lines 38 - 49, The try/except around the cookie dismissal currently swallows all
exceptions; change it to only catch the missing-banner case by catching
selenium.common.exceptions.NoSuchElementException when calling
driver.find_element(By.XPATH, ...), so that if the element is not found we
silently continue but any failures from
driver.execute_script("arguments[0].click();", cookie_btn) or click interception
surface as errors; locate the cookie handling block using the symbols
driver.find_element, By.XPATH, cookie_btn and driver.execute_script in
StocktonOnTeesCouncil.py and replace the broad except Exception with an explicit
NoSuchElementException catch (importing it if needed).

Comment on lines +55 to 63
dom = ElementTree.fromstring(resp.text)
result = dom.find(
"./soap:Body"
"/a:getRoundCalendarForUPRNResponse"
"/a:getRoundCalendarForUPRNResult",
namespaces,
)
# Make a BS4 object
soup = BeautifulSoup(page.text, features="html.parser")
soup.prettify()
inner_html = result.text if result is not None else ""

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't convert a missing SOAP result into an empty schedule.

If getRoundCalendarForUPRNResult disappears or the service returns a different envelope, inner_html = "" makes the scraper report “no bins” instead of surfacing the upstream break. Raise here with enough response context to diagnose the format change. Based on learnings: prefer explicit failures over silent defaults or swallowed errors when parsing council bin collection data.

🧰 Tools
🪛 Ruff (0.15.9)

[error] 55-55: Using xml to parse untrusted data is known to be vulnerable to XML attacks; use defusedxml equivalents

(S314)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@uk_bin_collection/uk_bin_collection/councils/TorridgeDistrictCouncil.py`
around lines 55 - 63, The code silently defaults inner_html to "" when
getRoundCalendarForUPRNResult is missing which hides upstream format changes;
modify the parsing around ElementTree.fromstring(...) and the result lookup so
that if result is None you raise an explicit exception (or return a failure)
including useful context such as resp.status_code and a trimmed resp.text (or
the raw envelope) and the XPath you attempted (getRoundCalendarForUPRNResult) so
callers can diagnose the changed SOAP envelope instead of treating it as "no
bins".

}
dom = ElementTree.fromstring(page.text)
page = dom.find(
dom = ElementTree.fromstring(resp.text)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Locate and read the file
find . -name "TorridgeDistrictCouncil.py" -type f

Repository: robbrad/UKBinCollectionData

Length of output: 144


🏁 Script executed:

# Once found, read the file to see the context around line 55
wc -l ./uk_bin_collection/uk_bin_collection/councils/TorridgeDistrictCouncil.py

Repository: robbrad/UKBinCollectionData

Length of output: 148


🏁 Script executed:

# Read lines around 55 with context
sed -n '1,70p' ./uk_bin_collection/uk_bin_collection/councils/TorridgeDistrictCouncil.py | cat -n

Repository: robbrad/UKBinCollectionData

Length of output: 3189


🏁 Script executed:

# Check imports and defusedxml usage in the file
head -30 ./uk_bin_collection/uk_bin_collection/councils/TorridgeDistrictCouncil.py | cat -n

Repository: robbrad/UKBinCollectionData

Length of output: 1371


🏁 Script executed:

# Search for defusedxml in the entire codebase
rg "defusedxml|from_string|ElementTree" --type py -i

Repository: robbrad/UKBinCollectionData

Length of output: 610


🏁 Script executed:

# Check for requirements or dependencies files
find . -name "requirements*.txt" -o -name "setup.py" -o -name "pyproject.toml" -o -name "Pipfile" | head -10

Repository: robbrad/UKBinCollectionData

Length of output: 135


🏁 Script executed:

# Check common dependency files
cat ./requirements.txt 2>/dev/null || cat ./setup.py 2>/dev/null || cat ./pyproject.toml 2>/dev/null | head -50

Repository: robbrad/UKBinCollectionData

Length of output: 1208


🏁 Script executed:

# Search for any defusedxml usage in the entire codebase
rg "defusedxml" --type py

Repository: robbrad/UKBinCollectionData

Length of output: 53


🏁 Script executed:

# Get the full dependencies section from pyproject.toml
cat ./pyproject.toml | grep -A 100 "dependencies"

Repository: robbrad/UKBinCollectionData

Length of output: 1561


🏁 Script executed:

# Check if defusedxml might be imported elsewhere
rg "import.*xml" --type py | head -20

Repository: robbrad/UKBinCollectionData

Length of output: 158


Use a hardened XML parser for the SOAP envelope.

ElementTree.fromstring(resp.text) is parsing untrusted remote XML from an external SOAP API. This is vulnerable to XXE (XML External Entity) attacks. Use defusedxml instead of the stdlib parser to safely parse the XML response.

🧰 Tools
🪛 Ruff (0.15.9)

[error] 55-55: Using xml to parse untrusted data is known to be vulnerable to XML attacks; use defusedxml equivalents

(S314)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@uk_bin_collection/uk_bin_collection/councils/TorridgeDistrictCouncil.py` at
line 55, Replace the unsafe stdlib XML parse call by using defusedxml's secure
parser: import defusedxml.ElementTree (e.g., from defusedxml import ElementTree
as DefusedET) and change the parsing call at the site where dom =
ElementTree.fromstring(resp.text) (in TorridgeDistrictCouncil.py) to use
DefusedET.fromstring on the response bytes/content; ensure you import defusedxml
and parse resp.content (not raw text) so the SOAP XML is parsed safely against
XXE attacks.

Comment on lines +18 to +26
# WestSuffolk's IIS returns 404 to requests without a User-Agent,
# so send a realistic browser UA.
headers = {
"User-Agent": (
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/134.0.0.0 Safari/537.36"
)
}
response = requests.get(api_url, headers=headers)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n uk_bin_collection/uk_bin_collection/councils/WestSuffolkCouncil.py | head -50

Repository: robbrad/UKBinCollectionData

Length of output: 2049


🏁 Script executed:

cat -n uk_bin_collection/uk_bin_collection/councils/WestSuffolkCouncil.py | sed -n '26,80p'

Repository: robbrad/UKBinCollectionData

Length of output: 2486


🏁 Script executed:

cat -n uk_bin_collection/uk_bin_collection/councils/WestSuffolkCouncil.py | tail -20

Repository: robbrad/UKBinCollectionData

Length of output: 965


Add explicit HTTP status check before parsing.

Line 26 fetches the response without validating the status code. If West Suffolk returns a 4xx/5xx error, line 28 parses the error page HTML, which won't contain the expected structure. This causes collection_tag to be empty (line 50), and the method silently returns {"bins": []} with no indication of failure. Add response.raise_for_status() immediately after line 26 to fail explicitly on upstream HTTP errors.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@uk_bin_collection/uk_bin_collection/councils/WestSuffolkCouncil.py` around
lines 18 - 26, The code performs requests.get(api_url, headers=headers) and then
parses the body without checking HTTP status, which can lead to silent failures;
update the WestSuffolkCouncil fetch flow by calling response.raise_for_status()
immediately after response = requests.get(api_url, headers=headers) so upstream
4xx/5xx errors raise and are visible instead of producing an empty
collection_tag and returning {"bins": []}; ensure this change is applied in the
method that builds headers/response and that any surrounding try/except handles
the raised HTTPError appropriately.

Comment on lines +64 to +69
date_text = " ".join(content.get_text(" ", strip=True).split())
date_text = remove_ordinal_indicator_from_date_string(date_text)
try:
collection_date = datetime.strptime(date_text, "%A %d %B")
except ValueError:
continue
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't silently drop a matched bin when its date stops parsing.

Once heading_text matched Your next ... collection, a ValueError here means the council changed the rendered date format. continue turns that into partial data instead of a clear scraper failure. Raise with heading_text/date_text so the drift is visible immediately. Based on learnings: prefer explicit failures over silent defaults or swallowed errors when parsing council bin collection data.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@uk_bin_collection/uk_bin_collection/councils/WyreCouncil.py` around lines 64
- 69, When parsing the date in WyreCouncil.py, do not silently continue on
ValueError; instead raise an error that includes the context so drift is
visible. Replace the silent "continue" in the try/except around
datetime.strptime(date_text, "%A %d %B") with an exception raise that includes
heading_text and date_text (and optionally the original ValueError) so
callers/logs show which heading and raw date failed; reference the variables
date_text, collection_date, heading_text and the helper
remove_ordinal_indicator_from_date_string to locate the parsing logic and update
the except block accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant