Add blog post: Building a Custom PySpark Data Source with Spark's Python Data Source API#17
Add blog post: Building a Custom PySpark Data Source with Spark's Python Data Source API#17fusionet24 wants to merge 2 commits into
Conversation
…hon Data Source API Introduces the Spark 4.0 Python Data Source API through a practical example that reads UK Bank Holidays from the gov.uk API into a Spark DataFrame. Mirrors the style of the existing ADF bank holidays post. https://claude.ai/code/session_01Xjvq3wun64343XnME8fvaw
There was a problem hiding this comment.
Pull request overview
Adds a new Quarto blog post introducing Spark 4.0’s Python Data Source API via a worked example that reads the UK Bank Holidays gov.uk API into a Spark DataFrame, aligning with the existing bank-holidays content on the site.
Changes:
- New post explaining the Python Data Source API concepts (
DataSource,DataSourceReader, registration) with examples. - Includes a full copy/pasteable Databricks notebook code block implementing the custom source.
- Provides usage examples for querying/filtering the resulting DataFrame.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| divisions = [region] if region else data.keys() | ||
|
|
There was a problem hiding this comment.
Same issue as earlier: an invalid region option will raise KeyError at data[division]. Validate and raise a user-friendly error listing supported regions.
| divisions = [region] if region else data.keys() | |
| available_divisions = list(data.keys()) | |
| if region: | |
| if region not in available_divisions: | |
| raise ValueError( | |
| f"Invalid region '{region}'. Supported regions are: " | |
| + ", ".join(sorted(available_divisions)) | |
| ) | |
| divisions = [region] | |
| else: | |
| divisions = available_divisions |
| def schema(self): | ||
| return StructType([ | ||
| StructField("division", StringType()), | ||
| StructField("title", StringType()), | ||
| StructField("date", StringType()), | ||
| StructField("notes", StringType()), | ||
| StructField("bunting", BooleanType()), | ||
| ]) |
There was a problem hiding this comment.
Same as earlier: the schema uses StringType for date even though it’s a date. Consider DateType and parsing to improve downstream usability and avoid extra casting in the examples.
| ) | ||
| ``` | ||
|
|
||
| One important detail: the `requests` import is **inside** the `read` method, not at the top of the class. This is because Spark needs to serialise (pickle) the reader and send it to executors. Imports at the class level can break this. |
There was a problem hiding this comment.
The explanation that “Imports at the class level can break” Spark pickling is misleading: top-level imports generally don’t affect pickling/serialization of the reader. If you want to keep the import inside read(), consider rewording this to the more accurate rationale (e.g., ensuring the dependency is only required on executors / avoiding driver-only environments without requests).
| response = requests.get("https://www.gov.uk/bank-holidays.json") | ||
| response.raise_for_status() | ||
| data = response.json() |
There was a problem hiding this comment.
requests.get() is called without a timeout. If the endpoint stalls, Spark tasks can hang indefinitely and tie up executors. Pass a reasonable timeout (and optionally retries/backoff) to make the read more reliable.
| region = self.options.get("region", None) | ||
| divisions = [region] if region else data.keys() | ||
|
|
There was a problem hiding this comment.
If a user passes an invalid region option, data[division] will raise a KeyError with little context. Validate region against the available keys (e.g., england-and-wales/scotland/northern-ireland) and raise a clear error message listing valid values.
| region = self.options.get("region", None) | |
| divisions = [region] if region else data.keys() | |
| available_divisions = data.keys() | |
| region = self.options.get("region", None) | |
| if region is not None and region not in available_divisions: | |
| valid = ", ".join(sorted(available_divisions)) | |
| raise ValueError( | |
| f"Invalid region '{region}'. Valid regions are: {valid}" | |
| ) | |
| divisions = [region] if region else available_divisions |
| def schema(self): | ||
| return StructType([ | ||
| StructField("division", StringType()), | ||
| StructField("title", StringType()), | ||
| StructField("date", StringType()), | ||
| StructField("notes", StringType()), | ||
| StructField("bunting", BooleanType()), | ||
| ]) |
There was a problem hiding this comment.
The schema defines date as StringType, but it represents an ISO date. Using DateType (and parsing to it) avoids repeated casts in examples and ensures correct date semantics for comparisons/joins.
| .count() > 0 | ||
| ) | ||
|
|
There was a problem hiding this comment.
The bank-holiday check uses .count() > 0, which forces a full scan/count. Use an existence check pattern (e.g., limit(1) / head(1) / take(1)) to avoid unnecessary work, especially if the source grows or is slower.
| .count() > 0 | |
| ) | |
| ) | |
| is_bank_holiday = len(is_bank_holiday.limit(1).take(1)) > 0 |
| response = requests.get("https://www.gov.uk/bank-holidays.json") | ||
| response.raise_for_status() | ||
| data = response.json() |
There was a problem hiding this comment.
Same as earlier: requests.get() is called without a timeout in the full code block. Add a timeout (and optionally retries/backoff) so example code doesn’t hang indefinitely on network issues.
Introduces the Spark 4.0 Python Data Source API through a practical example
that reads UK Bank Holidays from the gov.uk API into a Spark DataFrame.
Mirrors the style of the existing ADF bank holidays post.
https://claude.ai/code/session_01Xjvq3wun64343XnME8fvaw