Skip to content

fix: SouthStaffordshireDistrictCouncil - use objectId query param and new parse#1957

Open
InertiaUK wants to merge 7 commits intorobbrad:masterfrom
InertiaUK:fix/sstaffs-objectid
Open

fix: SouthStaffordshireDistrictCouncil - use objectId query param and new parse#1957
InertiaUK wants to merge 7 commits intorobbrad:masterfrom
InertiaUK:fix/sstaffs-objectid

Conversation

@InertiaUK
Copy link
Copy Markdown

What this changes

This is a sneaky regression. The scraper uses the URL where-i-live?uprn=<UPRN>, which looks reasonable and returns a 200 OK with a div#showCollectionDates — but the response is always the fallback message:

Due to access restrictions, the property selected is serviced by our van collection. For more information regarding your collection dates, please contact us.

The existing scraper correctly detects this fallback and returns an empty bins list, so every address looks like a van-collection property.

Root cause

The ?uprn= query parameter is a deprecated placeholder. The real calendar is now served by where-i-live?objectId=<UPRN> — same UPRN value, different query key. I only found this by tracing the AJAX flow on the public form at /viewyourcollectioncalendar, which submits to ajax_form=1 with the postcode and returns an address dropdown. Picking an address submits the form, which redirects to where-i-live?objectId=<UPRN> — and that page contains the real bin data.

What the new scraper does

  • Does its own requests.get with ?objectId=<UPRN> and a browser User-Agent — the framework's pre-fetch on the old URL can't be re-used because the query key is wrong.
  • Parses the new two-part result layout:
    • p.collection-date + p.collection-type for the "next collection" summary card at the top of div#showCollectionDates.
    • table.leisure-table underneath for subsequent collections.
  • Handles the new date format "Tuesday, 14 April 2026" (comma-separated, not "Tue 14 April 2026").
  • Splits composite bin types like "Recycling & Garden waste" into separate Recycling and Garden waste entries, since those are physically different bins.
  • De-dupes the overlap between the summary card and the first row of the table.
  • Keeps the van-collection short-circuit for properties that really are van-serviced.

Test fixture update

The previous fixture UPRN 200004523954 is a genuine van-collection property (confirmed by hitting the new objectId= URL), so it would still look empty to any reader running the tests. Bumping it to 100031802117 (1 Kempson Road, Penkridge ST19 5BG) — a real residential property with a full fortnightly rotation. The url field is updated to the new query param and the wiki_note explains the ?uprn= vs ?objectId= gotcha for anyone copying the URL pattern.

Test

Verified via collect_data against the new fixture:

{
  "bins": [
    { "type": "General waste", "collectionDate": "14/04/2026" },
    { "type": "Recycling",     "collectionDate": "21/04/2026" },
    { "type": "Garden waste",  "collectionDate": "21/04/2026" },
    { "type": "General waste", "collectionDate": "28/04/2026" },
    { "type": "Recycling",     "collectionDate": "05/05/2026" },
    { "type": "Garden waste",  "collectionDate": "05/05/2026" },
    { "type": "General waste", "collectionDate": "12/05/2026" }
  ]
}

Torridge's SOAP API changed its response from explicit dates ("Mon 14 Apr")
to relative phrases ("Tomorrow then every Mon", "Today then every Tue")
with an embedded calendar table. The old regex-based parser returned empty
because it expected the old format.

The rewritten parser handles Today/Tomorrow/weekday-name phrases, falls back
to the old explicit format if it reappears, and gracefully skips "No X
collection for this address" entries.
The Wyre bin collection page now includes a 'Download calendar' box
among the .boxed divs. This box has no h3.bin-collection-tasks__heading,
which caused the scraper to crash on .text access.

Changes:
- Skip boxes missing the heading or content container
- Use regex to extract the bin name from 'Your next X collection'
- Collapse whitespace on the date text before strptime (p tags produced
  multi-whitespace runs after .text)
- Roll year forward only when the computed date is in the past, instead
  of only handling the December->January edge case
…e results

The checkyourbinday page is now:
- Gated by Cloudflare Turnstile (requires non-headless UC)
- Using WASTECOLLECTIONCALENDARV7_* element IDs (was V5)
- Rendering collection dates inline as .gi-summary-blocklist__row
  divs after address selection — no separate NEXT/submit step

Changes:
- Update all element IDs from V5 to V7
- Add Cloudflare challenge wait loop (up to 50s)
- Dismiss cookie consent before interaction (was blocking button clicks)
- Replace old table-based parsing with .gi-summary-blocklist__row scraping
- Use Select.select_by_visible_text with stale-retry instead of manual
  option.click() loop (which crashed on AJAX re-renders)
- Remove the smart_select_address helper's dead fuzzy/strict split
…anner

Stockton's AchieveForms form name changed from LOOKUPBINDATESBYADDRESSSKIPOUTOFREGION
to LOOKUPBINDATESBYADDRESSSKIPOUTOFREGIONV2, so every element ID needed
the V2 suffix added. Also added a cookie-banner dismissal step — the
banner was covering the search button and intercepting clicks.
The MyWestSuffolk.aspx IIS endpoint returns a 404 page to requests
without a User-Agent header. Adding a realistic Chrome UA restores the
full response (57KB), letting the existing parser pick up the bin
collection panel correctly.
The directory_search.php postcode endpoint the old scraper used is gone;
Midlothian migrated to a MyMidlothian Granicus fillform iframe at
my.midlothian.gov.uk. Rewrote the scraper to:

- Load the Bin_Collection_Dates service page
- Switch into the fillform-frame-1 iframe
- Fill the postcode field (dropdown auto-populates on change)
- Select the matching address — this auto-fills six per-bin date fields
  (dateRecycling, dateFood, dateGarden, dateCard, dateGlass, dateResidual)
- Read the dates directly from those fields, parse them as dd/mm/yyyy
- No submit button needed

The fillform iframe detects headless Chrome and vanilla Selenium and
refuses to populate the dropdown, so when a DISPLAY is available the
scraper now uses undetected_chromedriver in non-headless mode.
…ew parse

The old where-i-live?uprn= query parameter is a placeholder that always
returns the 'van collection' fallback message. The real bin calendar is
served by where-i-live?objectId= (same UPRN value, different query key).

This only became visible after tracing the ajax_form=1 postcode lookup
flow on /viewyourcollectioncalendar — the form POST ultimately redirects
to /where-i-live?objectId=<UPRN>.

Changes:
- Scraper now does its own GET with objectId rather than relying on the
  framework's pre-fetched page (which used the wrong query key).
- Parses the new result structure: a 'next collection' summary card
  (p.collection-date / p.collection-type) plus a subsequent-collections
  leisure-table with 'Day, D Month YYYY' date strings.
- Splits composite bin types like 'Recycling & Garden waste' into
  separate entries for each underlying bin.
- De-dupes the summary/table overlap.

Test fixture in input.json needs updating to a real residential UPRN
(e.g. 100031802117 = 1 Kempson Road, ST19 5BG) — the previous test UPRN
200004523954 is a genuine van-collection property.
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 12, 2026

Warning

Rate limit exceeded

@InertiaUK has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 27 minutes and 46 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 27 minutes and 46 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7d0d19f5-7b07-45ea-909b-f8cff4d08133

📥 Commits

Reviewing files that changed from the base of the PR and between 60bd3cc and 819ea4a.

📒 Files selected for processing (8)
  • uk_bin_collection/tests/input.json
  • uk_bin_collection/uk_bin_collection/councils/ChichesterDistrictCouncil.py
  • uk_bin_collection/uk_bin_collection/councils/MidlothianCouncil.py
  • uk_bin_collection/uk_bin_collection/councils/SouthStaffordshireDistrictCouncil.py
  • uk_bin_collection/uk_bin_collection/councils/StocktonOnTeesCouncil.py
  • uk_bin_collection/uk_bin_collection/councils/TorridgeDistrictCouncil.py
  • uk_bin_collection/uk_bin_collection/councils/WestSuffolkCouncil.py
  • uk_bin_collection/uk_bin_collection/councils/WyreCouncil.py
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant