fix: MidlothianCouncil - rewrite for MyMidlothian Granicus fillform#1956
fix: MidlothianCouncil - rewrite for MyMidlothian Granicus fillform#1956InertiaUK wants to merge 6 commits intorobbrad:masterfrom
Conversation
Torridge's SOAP API changed its response from explicit dates ("Mon 14 Apr")
to relative phrases ("Tomorrow then every Mon", "Today then every Tue")
with an embedded calendar table. The old regex-based parser returned empty
because it expected the old format.
The rewritten parser handles Today/Tomorrow/weekday-name phrases, falls back
to the old explicit format if it reappears, and gracefully skips "No X
collection for this address" entries.
The Wyre bin collection page now includes a 'Download calendar' box among the .boxed divs. This box has no h3.bin-collection-tasks__heading, which caused the scraper to crash on .text access. Changes: - Skip boxes missing the heading or content container - Use regex to extract the bin name from 'Your next X collection' - Collapse whitespace on the date text before strptime (p tags produced multi-whitespace runs after .text) - Roll year forward only when the computed date is in the past, instead of only handling the December->January edge case
…e results The checkyourbinday page is now: - Gated by Cloudflare Turnstile (requires non-headless UC) - Using WASTECOLLECTIONCALENDARV7_* element IDs (was V5) - Rendering collection dates inline as .gi-summary-blocklist__row divs after address selection — no separate NEXT/submit step Changes: - Update all element IDs from V5 to V7 - Add Cloudflare challenge wait loop (up to 50s) - Dismiss cookie consent before interaction (was blocking button clicks) - Replace old table-based parsing with .gi-summary-blocklist__row scraping - Use Select.select_by_visible_text with stale-retry instead of manual option.click() loop (which crashed on AJAX re-renders) - Remove the smart_select_address helper's dead fuzzy/strict split
…anner Stockton's AchieveForms form name changed from LOOKUPBINDATESBYADDRESSSKIPOUTOFREGION to LOOKUPBINDATESBYADDRESSSKIPOUTOFREGIONV2, so every element ID needed the V2 suffix added. Also added a cookie-banner dismissal step — the banner was covering the search button and intercepting clicks.
The MyWestSuffolk.aspx IIS endpoint returns a 404 page to requests without a User-Agent header. Adding a realistic Chrome UA restores the full response (57KB), letting the existing parser pick up the bin collection panel correctly.
📝 WalkthroughWalkthroughUpdated five UK council bin collection scrapers with modernized integrations: Chichester and Midlothian migrated to Selenium-based form automation with enhanced robustness (Cloudflare detection, cookie dismissal, dropdown retry logic); Stockton updated to V2 element IDs; Torridge refactored SOAP/HTML parsing with relative date handling; WestSuffolk added User-Agent header. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~65 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
The directory_search.php postcode endpoint the old scraper used is gone; Midlothian migrated to a MyMidlothian Granicus fillform iframe at my.midlothian.gov.uk. Rewrote the scraper to: - Load the Bin_Collection_Dates service page - Switch into the fillform-frame-1 iframe - Fill the postcode field (dropdown auto-populates on change) - Select the matching address — this auto-fills six per-bin date fields (dateRecycling, dateFood, dateGarden, dateCard, dateGlass, dateResidual) - Read the dates directly from those fields, parse them as dd/mm/yyyy - No submit button needed The fillform iframe detects headless Chrome and vanilla Selenium and refuses to populate the dropdown, so when a DISPLAY is available the scraper now uses undetected_chromedriver in non-headless mode.
c932580 to
f89b6e2
Compare
There was a problem hiding this comment.
Actionable comments posted: 10
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
uk_bin_collection/uk_bin_collection/councils/StocktonOnTeesCouncil.py (1)
73-83:⚠️ Potential issue | 🟠 MajorAvoid interpolating raw address text into XPath.
This selector is built from
user_paondirectly. A house name containing'will produce an invalid XPath, andcontains(., ...)can also select the wrong option when one address string is a substring of another. Prefer reading the<select>options in Python viaSelectinstead of concatenating user input into XPath.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@uk_bin_collection/uk_bin_collection/councils/StocktonOnTeesCouncil.py` around lines 73 - 83, The current XPath concatenates user_paon into a locator which breaks on quotes and can match substrings; replace the WebDriverWait+XPath block that references user_paon with Selenium's Select API: find the select element by its ID (LOOKUPBINDATESBYADDRESSSKIPOUTOFREGIONV2_ADDRESSLOOKUPADDRESS), create a Select from it, iterate its options and choose the one whose visible text exactly matches user_paon (or sanitize/normalize whitespace before comparing), then click/select that option; keep the WebDriverWait for the select element to be clickable/visible before creating the Select to preserve timing.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@uk_bin_collection/uk_bin_collection/councils/ChichesterDistrictCouncil.py`:
- Around line 116-122: The _wait_for_cloudflare helper currently returns
silently after 25 polls even if the page title still shows the Cloudflare
challenge; change it so that after the polling loop, if the title still contains
"Just a moment" or "Attention" it raises a
selenium.common.exceptions.TimeoutException (with a clear message like
"Cloudflare challenge not cleared after timeout") instead of returning, so
callers immediately get a meaningful failure; update the
_wait_for_cloudflare(driver) function to perform the check after the loop and
raise TimeoutException when still blocked.
- Around line 190-193: The loop in ChichesterDistrictCouncil.py that checks
options (for opt in options: text = opt.text.strip(); if target in text.lower():
return text) is too permissive because substring matches like "1" will match
"10" or "21"; change the logic in the function that performs this fuzzy address
pass to use token-boundary matching (e.g., match whole tokens or use regex word
boundaries around target against text.lower()) or, alternatively, collect all
fuzzy matches into a list first and only return when there is exactly one unique
fuzzy hit; update the code that references options/opt/text/target accordingly
so ambiguous substring matches are not returned silently.
In `@uk_bin_collection/uk_bin_collection/councils/MidlothianCouncil.py`:
- Around line 117-127: The code currently swallows parsing errors and missing
data; change the logic in the block using driver.find_element(By.ID, field_id)
and datetime.strptime(raw, "%d/%m/%Y") so that you only continue when the
element is truly missing or the value is genuinely blank (empty/whitespace). Do
not continue on a ValueError from datetime.strptime; instead raise an exception
(or re-raise a ValueError) that includes the field_id and raw value so scraper
drift is visible. Keep the try/except around driver.find_element to handle a
missing element, but remove the broad except around parsing and replace it with
an explicit raise that surfaces field_id and raw when parsing fails.
- Around line 154-157: The current fuzzy fallback in MidlothianCouncil.py loops
over select.options (variables opt, text) and returns the first option whose
text.upper() contains target, which erroneously matches house_identifier "1"
inside "10" or "21"; change the logic in that loop to either 1) perform a
boundary-aware match (use token or regex word/prefix boundary against target on
text) or 2) collect all fuzzy matches and only return when exactly one match is
found (otherwise continue searching or raise/skip), and reference the variables
select.options, opt, text, target and the house_identifier usage to locate where
to apply the stricter match.
- Around line 56-67: The fallback forcibly passes a non-headless flag to
create_webdriver; change the hardcoded False to use the actual headless setting
(e.g. kwargs.get("headless")) so the fallback honors the caller's headless
intent. Locate the branch that sets driver (the undetected_chromedriver
try/except and the subsequent call create_webdriver(web_driver, False, None,
__name__)) and replace the False with the headless value derived from kwargs
(with a sensible default such as True if absent) so create_webdriver(web_driver,
headless, None, __name__) is called.
In `@uk_bin_collection/uk_bin_collection/councils/StocktonOnTeesCouncil.py`:
- Around line 38-49: The try/except around the cookie dismissal currently
swallows all exceptions; change it to only catch the missing-banner case by
catching selenium.common.exceptions.NoSuchElementException when calling
driver.find_element(By.XPATH, ...), so that if the element is not found we
silently continue but any failures from
driver.execute_script("arguments[0].click();", cookie_btn) or click interception
surface as errors; locate the cookie handling block using the symbols
driver.find_element, By.XPATH, cookie_btn and driver.execute_script in
StocktonOnTeesCouncil.py and replace the broad except Exception with an explicit
NoSuchElementException catch (importing it if needed).
In `@uk_bin_collection/uk_bin_collection/councils/TorridgeDistrictCouncil.py`:
- Around line 55-63: The code silently defaults inner_html to "" when
getRoundCalendarForUPRNResult is missing which hides upstream format changes;
modify the parsing around ElementTree.fromstring(...) and the result lookup so
that if result is None you raise an explicit exception (or return a failure)
including useful context such as resp.status_code and a trimmed resp.text (or
the raw envelope) and the XPath you attempted (getRoundCalendarForUPRNResult) so
callers can diagnose the changed SOAP envelope instead of treating it as "no
bins".
- Line 55: Replace the unsafe stdlib XML parse call by using defusedxml's secure
parser: import defusedxml.ElementTree (e.g., from defusedxml import ElementTree
as DefusedET) and change the parsing call at the site where dom =
ElementTree.fromstring(resp.text) (in TorridgeDistrictCouncil.py) to use
DefusedET.fromstring on the response bytes/content; ensure you import defusedxml
and parse resp.content (not raw text) so the SOAP XML is parsed safely against
XXE attacks.
In `@uk_bin_collection/uk_bin_collection/councils/WestSuffolkCouncil.py`:
- Around line 18-26: The code performs requests.get(api_url, headers=headers)
and then parses the body without checking HTTP status, which can lead to silent
failures; update the WestSuffolkCouncil fetch flow by calling
response.raise_for_status() immediately after response = requests.get(api_url,
headers=headers) so upstream 4xx/5xx errors raise and are visible instead of
producing an empty collection_tag and returning {"bins": []}; ensure this change
is applied in the method that builds headers/response and that any surrounding
try/except handles the raised HTTPError appropriately.
In `@uk_bin_collection/uk_bin_collection/councils/WyreCouncil.py`:
- Around line 64-69: When parsing the date in WyreCouncil.py, do not silently
continue on ValueError; instead raise an error that includes the context so
drift is visible. Replace the silent "continue" in the try/except around
datetime.strptime(date_text, "%A %d %B") with an exception raise that includes
heading_text and date_text (and optionally the original ValueError) so
callers/logs show which heading and raw date failed; reference the variables
date_text, collection_date, heading_text and the helper
remove_ordinal_indicator_from_date_string to locate the parsing logic and update
the except block accordingly.
---
Outside diff comments:
In `@uk_bin_collection/uk_bin_collection/councils/StocktonOnTeesCouncil.py`:
- Around line 73-83: The current XPath concatenates user_paon into a locator
which breaks on quotes and can match substrings; replace the WebDriverWait+XPath
block that references user_paon with Selenium's Select API: find the select
element by its ID
(LOOKUPBINDATESBYADDRESSSKIPOUTOFREGIONV2_ADDRESSLOOKUPADDRESS), create a Select
from it, iterate its options and choose the one whose visible text exactly
matches user_paon (or sanitize/normalize whitespace before comparing), then
click/select that option; keep the WebDriverWait for the select element to be
clickable/visible before creating the Select to preserve timing.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 0cd83b4f-2a0c-4e28-a678-82b4c57294da
📒 Files selected for processing (6)
uk_bin_collection/uk_bin_collection/councils/ChichesterDistrictCouncil.pyuk_bin_collection/uk_bin_collection/councils/MidlothianCouncil.pyuk_bin_collection/uk_bin_collection/councils/StocktonOnTeesCouncil.pyuk_bin_collection/uk_bin_collection/councils/TorridgeDistrictCouncil.pyuk_bin_collection/uk_bin_collection/councils/WestSuffolkCouncil.pyuk_bin_collection/uk_bin_collection/councils/WyreCouncil.py
| def _wait_for_cloudflare(self, driver): | ||
| for _ in range(25): | ||
| title = driver.title or "" | ||
| if "Just a moment" not in title and "Attention" not in title: | ||
| return | ||
| time.sleep(2) | ||
|
|
There was a problem hiding this comment.
Raise when Cloudflare never clears.
This helper returns after 25 polls even if the page is still on the challenge screen, so the real failure turns into a later “element not found”. End the loop with a TimeoutException while the title still indicates Cloudflare.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@uk_bin_collection/uk_bin_collection/councils/ChichesterDistrictCouncil.py`
around lines 116 - 122, The _wait_for_cloudflare helper currently returns
silently after 25 polls even if the page title still shows the Cloudflare
challenge; change it so that after the polling loop, if the title still contains
"Just a moment" or "Attention" it raises a
selenium.common.exceptions.TimeoutException (with a clear message like
"Cloudflare challenge not cleared after timeout") instead of returning, so
callers immediately get a meaningful failure; update the
_wait_for_cloudflare(driver) function to perform the check after the loop and
raise TimeoutException when still blocked.
| for opt in options: | ||
| text = opt.text.strip() | ||
| if target in text.lower(): | ||
| return text |
There was a problem hiding this comment.
The fuzzy address pass is too permissive.
if target in text.lower() will happily map 1 to 10 ... or 21 ... when the strict prefix match misses. That's a silent wrong-property result, not just a missed match. Prefer token-boundary matching or require a unique fuzzy hit before returning it.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@uk_bin_collection/uk_bin_collection/councils/ChichesterDistrictCouncil.py`
around lines 190 - 193, The loop in ChichesterDistrictCouncil.py that checks
options (for opt in options: text = opt.text.strip(); if target in text.lower():
return text) is too permissive because substring matches like "1" will match
"10" or "21"; change the logic in the function that performs this fuzzy address
pass to use token-boundary matching (e.g., match whole tokens or use regex word
boundaries around target against text.lower()) or, alternatively, collect all
fuzzy matches into a list first and only return when there is exactly one unique
fuzzy hit; update the code that references options/opt/text/target accordingly
so ambiguous substring matches are not returned silently.
| if os.environ.get("DISPLAY") and web_driver is None: | ||
| try: | ||
| import undetected_chromedriver as uc | ||
| uc_opts = uc.ChromeOptions() | ||
| uc_opts.add_argument("--no-sandbox") | ||
| uc_opts.add_argument("--disable-dev-shm-usage") | ||
| uc_opts.add_argument("--window-size=1920,1080") | ||
| driver = uc.Chrome(options=uc_opts, version_main=146) | ||
| except Exception: | ||
| driver = None | ||
| if driver is None: | ||
| driver = create_webdriver(web_driver, False, None, __name__) |
There was a problem hiding this comment.
The fallback path is still forced non-headless.
In uk_bin_collection/uk_bin_collection/common.py:321-360, the second create_webdriver() argument controls headless mode. Passing False unconditionally here ignores kwargs["headless"] and makes the no-DISPLAY fallback try to launch a visible browser anyway, which is exactly the environment this branch is meant to cover.
🧰 Tools
🪛 Ruff (0.15.9)
[warning] 64-64: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@uk_bin_collection/uk_bin_collection/councils/MidlothianCouncil.py` around
lines 56 - 67, The fallback forcibly passes a non-headless flag to
create_webdriver; change the hardcoded False to use the actual headless setting
(e.g. kwargs.get("headless")) so the fallback honors the caller's headless
intent. Locate the branch that sets driver (the undetected_chromedriver
try/except and the subsequent call create_webdriver(web_driver, False, None,
__name__)) and replace the False with the headless value derived from kwargs
(with a sensible default such as True if absent) so create_webdriver(web_driver,
headless, None, __name__) is called.
| try: | ||
| value = driver.find_element(By.ID, field_id).get_attribute("value") | ||
| except Exception: | ||
| continue | ||
| if not value: | ||
| continue | ||
| raw = value.strip() | ||
| try: | ||
| parsed = datetime.strptime(raw, "%d/%m/%Y") | ||
| except ValueError: | ||
| continue |
There was a problem hiding this comment.
Only genuinely blank optional fields should be skipped.
A missing element or a non-empty value that no longer matches %d/%m/%Y is scraper drift, not “no collection”. The current continue can silently drop one or more bins and still report success. Raise with field_id/raw for those cases, and only ignore truly blank values. Based on learnings: prefer explicit failures over silent defaults or swallowed errors when parsing council bin collection data.
🧰 Tools
🪛 Ruff (0.15.9)
[error] 119-120: try-except-continue detected, consider logging the exception
(S112)
[warning] 119-119: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@uk_bin_collection/uk_bin_collection/councils/MidlothianCouncil.py` around
lines 117 - 127, The code currently swallows parsing errors and missing data;
change the logic in the block using driver.find_element(By.ID, field_id) and
datetime.strptime(raw, "%d/%m/%Y") so that you only continue when the element is
truly missing or the value is genuinely blank (empty/whitespace). Do not
continue on a ValueError from datetime.strptime; instead raise an exception (or
re-raise a ValueError) that includes the field_id and raw value so scraper drift
is visible. Keep the try/except around driver.find_element to handle a missing
element, but remove the broad except around parsing and replace it with an
explicit raise that surfaces field_id and raw when parsing fails.
| for opt in select.options: | ||
| text = opt.text.strip() | ||
| if target in text.upper(): | ||
| return text |
There was a problem hiding this comment.
The substring fallback can select the wrong property.
If house_identifier is 1, this pass also matches 10, 21, etc. That can return another address's dates with no error. Tighten the fuzzy match to a token/prefix boundary, or require exactly one fuzzy hit before returning it.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@uk_bin_collection/uk_bin_collection/councils/MidlothianCouncil.py` around
lines 154 - 157, The current fuzzy fallback in MidlothianCouncil.py loops over
select.options (variables opt, text) and returns the first option whose
text.upper() contains target, which erroneously matches house_identifier "1"
inside "10" or "21"; change the logic in that loop to either 1) perform a
boundary-aware match (use token or regex word/prefix boundary against target on
text) or 2) collect all fuzzy matches and only return when exactly one match is
found (otherwise continue searching or raise/skip), and reference the variables
select.options, opt, text, target and the house_identifier usage to locate where
to apply the stricter match.
| # Dismiss cookie consent banner (clicks get intercepted otherwise) | ||
| try: | ||
| time.sleep(2) | ||
| cookie_btn = driver.find_element( | ||
| By.XPATH, | ||
| "//button[contains(translate(., 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'," | ||
| "'abcdefghijklmnopqrstuvwxyz'),'accept')]", | ||
| ) | ||
| driver.execute_script("arguments[0].click();", cookie_btn) | ||
| time.sleep(1) | ||
| except Exception: | ||
| pass |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
fd StocktonOnTeesCouncil.pyRepository: robbrad/UKBinCollectionData
Length of output: 140
🏁 Script executed:
cat -n uk_bin_collection/uk_bin_collection/councils/StocktonOnTeesCouncil.py | sed -n '1,100p'Repository: robbrad/UKBinCollectionData
Length of output: 4467
🏁 Script executed:
cat -n uk_bin_collection/uk_bin_collection/councils/StocktonOnTeesCouncil.py | sed -n '38,160p'Repository: robbrad/UKBinCollectionData
Length of output: 5772
🏁 Script executed:
rg "NoSuchElementException|find_element" uk_bin_collection/uk_bin_collection/councils/StocktonOnTeesCouncil.pyRepository: robbrad/UKBinCollectionData
Length of output: 120
🏁 Script executed:
cat -n uk_bin_collection/uk_bin_collection/councils/StocktonOnTeesCouncil.py | tail -20Repository: robbrad/UKBinCollectionData
Length of output: 953
Catch only the "banner not present" exception; let click failures surface.
The broad except Exception: pass masks real click or script execution failures during cookie banner dismissal. If the banner exists but the execute_script() call fails (e.g., due to click interception), the error is silently swallowed and execution continues. This makes production debugging harder—subsequent timeout errors won't reveal the root cause was a cookie banner interaction failure. Catch NoSuchElementException explicitly when the banner is missing, and let click/script failures propagate to surface at the actual point of failure.
🧰 Tools
🪛 Ruff (0.15.9)
[error] 48-49: try-except-pass detected, consider logging the exception
(S110)
[warning] 48-48: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@uk_bin_collection/uk_bin_collection/councils/StocktonOnTeesCouncil.py` around
lines 38 - 49, The try/except around the cookie dismissal currently swallows all
exceptions; change it to only catch the missing-banner case by catching
selenium.common.exceptions.NoSuchElementException when calling
driver.find_element(By.XPATH, ...), so that if the element is not found we
silently continue but any failures from
driver.execute_script("arguments[0].click();", cookie_btn) or click interception
surface as errors; locate the cookie handling block using the symbols
driver.find_element, By.XPATH, cookie_btn and driver.execute_script in
StocktonOnTeesCouncil.py and replace the broad except Exception with an explicit
NoSuchElementException catch (importing it if needed).
| dom = ElementTree.fromstring(resp.text) | ||
| result = dom.find( | ||
| "./soap:Body" | ||
| "/a:getRoundCalendarForUPRNResponse" | ||
| "/a:getRoundCalendarForUPRNResult", | ||
| namespaces, | ||
| ) | ||
| # Make a BS4 object | ||
| soup = BeautifulSoup(page.text, features="html.parser") | ||
| soup.prettify() | ||
| inner_html = result.text if result is not None else "" | ||
|
|
There was a problem hiding this comment.
Don't convert a missing SOAP result into an empty schedule.
If getRoundCalendarForUPRNResult disappears or the service returns a different envelope, inner_html = "" makes the scraper report “no bins” instead of surfacing the upstream break. Raise here with enough response context to diagnose the format change. Based on learnings: prefer explicit failures over silent defaults or swallowed errors when parsing council bin collection data.
🧰 Tools
🪛 Ruff (0.15.9)
[error] 55-55: Using xml to parse untrusted data is known to be vulnerable to XML attacks; use defusedxml equivalents
(S314)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@uk_bin_collection/uk_bin_collection/councils/TorridgeDistrictCouncil.py`
around lines 55 - 63, The code silently defaults inner_html to "" when
getRoundCalendarForUPRNResult is missing which hides upstream format changes;
modify the parsing around ElementTree.fromstring(...) and the result lookup so
that if result is None you raise an explicit exception (or return a failure)
including useful context such as resp.status_code and a trimmed resp.text (or
the raw envelope) and the XPath you attempted (getRoundCalendarForUPRNResult) so
callers can diagnose the changed SOAP envelope instead of treating it as "no
bins".
| } | ||
| dom = ElementTree.fromstring(page.text) | ||
| page = dom.find( | ||
| dom = ElementTree.fromstring(resp.text) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Locate and read the file
find . -name "TorridgeDistrictCouncil.py" -type fRepository: robbrad/UKBinCollectionData
Length of output: 144
🏁 Script executed:
# Once found, read the file to see the context around line 55
wc -l ./uk_bin_collection/uk_bin_collection/councils/TorridgeDistrictCouncil.pyRepository: robbrad/UKBinCollectionData
Length of output: 148
🏁 Script executed:
# Read lines around 55 with context
sed -n '1,70p' ./uk_bin_collection/uk_bin_collection/councils/TorridgeDistrictCouncil.py | cat -nRepository: robbrad/UKBinCollectionData
Length of output: 3189
🏁 Script executed:
# Check imports and defusedxml usage in the file
head -30 ./uk_bin_collection/uk_bin_collection/councils/TorridgeDistrictCouncil.py | cat -nRepository: robbrad/UKBinCollectionData
Length of output: 1371
🏁 Script executed:
# Search for defusedxml in the entire codebase
rg "defusedxml|from_string|ElementTree" --type py -iRepository: robbrad/UKBinCollectionData
Length of output: 610
🏁 Script executed:
# Check for requirements or dependencies files
find . -name "requirements*.txt" -o -name "setup.py" -o -name "pyproject.toml" -o -name "Pipfile" | head -10Repository: robbrad/UKBinCollectionData
Length of output: 135
🏁 Script executed:
# Check common dependency files
cat ./requirements.txt 2>/dev/null || cat ./setup.py 2>/dev/null || cat ./pyproject.toml 2>/dev/null | head -50Repository: robbrad/UKBinCollectionData
Length of output: 1208
🏁 Script executed:
# Search for any defusedxml usage in the entire codebase
rg "defusedxml" --type pyRepository: robbrad/UKBinCollectionData
Length of output: 53
🏁 Script executed:
# Get the full dependencies section from pyproject.toml
cat ./pyproject.toml | grep -A 100 "dependencies"Repository: robbrad/UKBinCollectionData
Length of output: 1561
🏁 Script executed:
# Check if defusedxml might be imported elsewhere
rg "import.*xml" --type py | head -20Repository: robbrad/UKBinCollectionData
Length of output: 158
Use a hardened XML parser for the SOAP envelope.
ElementTree.fromstring(resp.text) is parsing untrusted remote XML from an external SOAP API. This is vulnerable to XXE (XML External Entity) attacks. Use defusedxml instead of the stdlib parser to safely parse the XML response.
🧰 Tools
🪛 Ruff (0.15.9)
[error] 55-55: Using xml to parse untrusted data is known to be vulnerable to XML attacks; use defusedxml equivalents
(S314)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@uk_bin_collection/uk_bin_collection/councils/TorridgeDistrictCouncil.py` at
line 55, Replace the unsafe stdlib XML parse call by using defusedxml's secure
parser: import defusedxml.ElementTree (e.g., from defusedxml import ElementTree
as DefusedET) and change the parsing call at the site where dom =
ElementTree.fromstring(resp.text) (in TorridgeDistrictCouncil.py) to use
DefusedET.fromstring on the response bytes/content; ensure you import defusedxml
and parse resp.content (not raw text) so the SOAP XML is parsed safely against
XXE attacks.
| # WestSuffolk's IIS returns 404 to requests without a User-Agent, | ||
| # so send a realistic browser UA. | ||
| headers = { | ||
| "User-Agent": ( | ||
| "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 " | ||
| "(KHTML, like Gecko) Chrome/134.0.0.0 Safari/537.36" | ||
| ) | ||
| } | ||
| response = requests.get(api_url, headers=headers) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n uk_bin_collection/uk_bin_collection/councils/WestSuffolkCouncil.py | head -50Repository: robbrad/UKBinCollectionData
Length of output: 2049
🏁 Script executed:
cat -n uk_bin_collection/uk_bin_collection/councils/WestSuffolkCouncil.py | sed -n '26,80p'Repository: robbrad/UKBinCollectionData
Length of output: 2486
🏁 Script executed:
cat -n uk_bin_collection/uk_bin_collection/councils/WestSuffolkCouncil.py | tail -20Repository: robbrad/UKBinCollectionData
Length of output: 965
Add explicit HTTP status check before parsing.
Line 26 fetches the response without validating the status code. If West Suffolk returns a 4xx/5xx error, line 28 parses the error page HTML, which won't contain the expected structure. This causes collection_tag to be empty (line 50), and the method silently returns {"bins": []} with no indication of failure. Add response.raise_for_status() immediately after line 26 to fail explicitly on upstream HTTP errors.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@uk_bin_collection/uk_bin_collection/councils/WestSuffolkCouncil.py` around
lines 18 - 26, The code performs requests.get(api_url, headers=headers) and then
parses the body without checking HTTP status, which can lead to silent failures;
update the WestSuffolkCouncil fetch flow by calling response.raise_for_status()
immediately after response = requests.get(api_url, headers=headers) so upstream
4xx/5xx errors raise and are visible instead of producing an empty
collection_tag and returning {"bins": []}; ensure this change is applied in the
method that builds headers/response and that any surrounding try/except handles
the raised HTTPError appropriately.
| date_text = " ".join(content.get_text(" ", strip=True).split()) | ||
| date_text = remove_ordinal_indicator_from_date_string(date_text) | ||
| try: | ||
| collection_date = datetime.strptime(date_text, "%A %d %B") | ||
| except ValueError: | ||
| continue |
There was a problem hiding this comment.
Don't silently drop a matched bin when its date stops parsing.
Once heading_text matched Your next ... collection, a ValueError here means the council changed the rendered date format. continue turns that into partial data instead of a clear scraper failure. Raise with heading_text/date_text so the drift is visible immediately. Based on learnings: prefer explicit failures over silent defaults or swallowed errors when parsing council bin collection data.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@uk_bin_collection/uk_bin_collection/councils/WyreCouncil.py` around lines 64
- 69, When parsing the date in WyreCouncil.py, do not silently continue on
ValueError; instead raise an error that includes the context so drift is
visible. Replace the silent "continue" in the try/except around
datetime.strptime(date_text, "%A %d %B") with an exception raise that includes
heading_text and date_text (and optionally the original ValueError) so
callers/logs show which heading and raw date failed; reference the variables
date_text, collection_date, heading_text and the helper
remove_ordinal_indicator_from_date_string to locate the parsing logic and update
the except block accordingly.
What this changes
Midlothian's
directory_search.php?directoryID=35endpoint is gone. The bin collection service has moved to a new MyMidlothian self-service portal athttps://my.midlothian.gov.uk/service/Bin_Collection_Dates, which is a Granicus fillform rendered inside an iframe.New flow
The old scraper's multi-step search-results -> address-link ->
ul.data-tablenavigation no longer applies. The new layout:fillform-frame-1iframe.postcodeinput — thelistAddress<select>dropdown auto-populates via AJAX (no explicit search button).dateRecycling,dateFood,dateGarden,dateCard,dateGlass,dateResidual. No submit step.valueattribute — they're already indd/mm/yyyyformat.Bot detection note
The fillform iframe detects both headless Chrome and vanilla Selenium (presumably via
navigator.webdriver) and refuses to populate the address dropdown. The scraper now usesundetected_chromedriverin non-headless mode when aDISPLAYenvironment variable is available (e.g. Xvfb on a headless server), and falls back tocreate_webdriverotherwise.Test
Verified via the VPS wrapper against
EH19 2EB #52:Summary by CodeRabbit