Skip to content

Add blog post: Building a Custom PySpark Data Source with Spark's Python Data Source API#17

Open
fusionet24 wants to merge 2 commits into
mainfrom
claude/spark-datasource-api-tutorial-6Pp2Q
Open

Add blog post: Building a Custom PySpark Data Source with Spark's Python Data Source API#17
fusionet24 wants to merge 2 commits into
mainfrom
claude/spark-datasource-api-tutorial-6Pp2Q

Conversation

@fusionet24
Copy link
Copy Markdown
Owner

Introduces the Spark 4.0 Python Data Source API through a practical example
that reads UK Bank Holidays from the gov.uk API into a Spark DataFrame.
Mirrors the style of the existing ADF bank holidays post.

https://claude.ai/code/session_01Xjvq3wun64343XnME8fvaw

…hon Data Source API

Introduces the Spark 4.0 Python Data Source API through a practical example
that reads UK Bank Holidays from the gov.uk API into a Spark DataFrame.
Mirrors the style of the existing ADF bank holidays post.

https://claude.ai/code/session_01Xjvq3wun64343XnME8fvaw
Copilot AI review requested due to automatic review settings February 7, 2026 15:52
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new Quarto blog post introducing Spark 4.0’s Python Data Source API via a worked example that reads the UK Bank Holidays gov.uk API into a Spark DataFrame, aligning with the existing bank-holidays content on the site.

Changes:

  • New post explaining the Python Data Source API concepts (DataSource, DataSourceReader, registration) with examples.
  • Includes a full copy/pasteable Databricks notebook code block implementing the custom source.
  • Provides usage examples for querying/filtering the resulting DataFrame.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +233 to +234
divisions = [region] if region else data.keys()

Copy link

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as earlier: an invalid region option will raise KeyError at data[division]. Validate and raise a user-friendly error listing supported regions.

Suggested change
divisions = [region] if region else data.keys()
available_divisions = list(data.keys())
if region:
if region not in available_divisions:
raise ValueError(
f"Invalid region '{region}'. Supported regions are: "
+ ", ".join(sorted(available_divisions))
)
divisions = [region]
else:
divisions = available_divisions

Copilot uses AI. Check for mistakes.
Comment on lines +207 to +214
def schema(self):
return StructType([
StructField("division", StringType()),
StructField("title", StringType()),
StructField("date", StringType()),
StructField("notes", StringType()),
StructField("bunting", BooleanType()),
])
Copy link

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as earlier: the schema uses StringType for date even though it’s a date. Consider DateType and parsing to improve downstream usability and avoid extra casting in the examples.

Copilot uses AI. Check for mistakes.
)
```

One important detail: the `requests` import is **inside** the `read` method, not at the top of the class. This is because Spark needs to serialise (pickle) the reader and send it to executors. Imports at the class level can break this.
Copy link

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The explanation that “Imports at the class level can break” Spark pickling is misleading: top-level imports generally don’t affect pickling/serialization of the reader. If you want to keep the import inside read(), consider rewording this to the more accurate rationale (e.g., ensuring the dependency is only required on executors / avoiding driver-only environments without requests).

Copilot uses AI. Check for mistakes.
Comment on lines +90 to +92
response = requests.get("https://www.gov.uk/bank-holidays.json")
response.raise_for_status()
data = response.json()
Copy link

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

requests.get() is called without a timeout. If the endpoint stalls, Spark tasks can hang indefinitely and tie up executors. Pass a reasonable timeout (and optionally retries/backoff) to make the read more reliable.

Copilot uses AI. Check for mistakes.
Comment on lines +94 to +96
region = self.options.get("region", None)
divisions = [region] if region else data.keys()

Copy link

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a user passes an invalid region option, data[division] will raise a KeyError with little context. Validate region against the available keys (e.g., england-and-wales/scotland/northern-ireland) and raise a clear error message listing valid values.

Suggested change
region = self.options.get("region", None)
divisions = [region] if region else data.keys()
available_divisions = data.keys()
region = self.options.get("region", None)
if region is not None and region not in available_divisions:
valid = ", ".join(sorted(available_divisions))
raise ValueError(
f"Invalid region '{region}'. Valid regions are: {valid}"
)
divisions = [region] if region else available_divisions

Copilot uses AI. Check for mistakes.
Comment on lines +62 to +69
def schema(self):
return StructType([
StructField("division", StringType()),
StructField("title", StringType()),
StructField("date", StringType()),
StructField("notes", StringType()),
StructField("bunting", BooleanType()),
])
Copy link

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The schema defines date as StringType, but it represents an ISO date. Using DateType (and parsing to it) avoids repeated casts in examples and ensures correct date semantics for comparisons/joins.

Copilot uses AI. Check for mistakes.
Comment on lines +173 to +175
.count() > 0
)

Copy link

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bank-holiday check uses .count() > 0, which forces a full scan/count. Use an existence check pattern (e.g., limit(1) / head(1) / take(1)) to avoid unnecessary work, especially if the source grows or is slower.

Suggested change
.count() > 0
)
)
is_bank_holiday = len(is_bank_holiday.limit(1).take(1)) > 0

Copilot uses AI. Check for mistakes.
Comment on lines +228 to +230
response = requests.get("https://www.gov.uk/bank-holidays.json")
response.raise_for_status()
data = response.json()
Copy link

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as earlier: requests.get() is called without a timeout in the full code block. Add a timeout (and optionally retries/backoff) so example code doesn’t hang indefinitely on network issues.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants