Add blog post: Building a Custom PySpark Data Source with Spark's Python Data Source API by fusionet24 · Pull Request #17 · fusionet24/myyearindata

fusionet24 · 2026-02-07T15:52:30Z

Introduces the Spark 4.0 Python Data Source API through a practical example
that reads UK Bank Holidays from the gov.uk API into a Spark DataFrame.
Mirrors the style of the existing ADF bank holidays post.

https://claude.ai/code/session_01Xjvq3wun64343XnME8fvaw

…hon Data Source API Introduces the Spark 4.0 Python Data Source API through a practical example that reads UK Bank Holidays from the gov.uk API into a Spark DataFrame. Mirrors the style of the existing ADF bank holidays post. https://claude.ai/code/session_01Xjvq3wun64343XnME8fvaw

https://claude.ai/code/session_01Xjvq3wun64343XnME8fvaw

Copilot

Pull request overview

Adds a new Quarto blog post introducing Spark 4.0’s Python Data Source API via a worked example that reads the UK Bank Holidays gov.uk API into a Spark DataFrame, aligning with the existing bank-holidays content on the site.

Changes:

New post explaining the Python Data Source API concepts (DataSource, DataSourceReader, registration) with examples.
Includes a full copy/pasteable Databricks notebook code block implementing the custom source.
Provides usage examples for querying/filtering the resulting DataFrame.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-07T15:55:37Z

+        divisions = [region] if region else data.keys()
+


Same issue as earlier: an invalid region option will raise KeyError at data[division]. Validate and raise a user-friendly error listing supported regions.

Suggested change

divisions = [region] if region else data.keys()

available_divisions = list(data.keys())

if region:

if region not in available_divisions:

raise ValueError(

f"Invalid region '{region}'. Supported regions are: "

+ ", ".join(sorted(available_divisions))

)

divisions = [region]

else:

divisions = available_divisions

Copilot · 2026-02-07T15:55:37Z

+    def schema(self):
+        return StructType([
+            StructField("division", StringType()),
+            StructField("title", StringType()),
+            StructField("date", StringType()),
+            StructField("notes", StringType()),
+            StructField("bunting", BooleanType()),
+        ])


Same as earlier: the schema uses StringType for date even though it’s a date. Consider DateType and parsing to improve downstream usability and avoid extra casting in the examples.

Copilot · 2026-02-07T15:55:38Z

+                )
+```
+
+One important detail: the `requests` import is **inside** the `read` method, not at the top of the class. This is because Spark needs to serialise (pickle) the reader and send it to executors. Imports at the class level can break this.


The explanation that “Imports at the class level can break” Spark pickling is misleading: top-level imports generally don’t affect pickling/serialization of the reader. If you want to keep the import inside read(), consider rewording this to the more accurate rationale (e.g., ensuring the dependency is only required on executors / avoiding driver-only environments without requests).

Copilot · 2026-02-07T15:55:38Z

+        response = requests.get("https://www.gov.uk/bank-holidays.json")
+        response.raise_for_status()
+        data = response.json()


requests.get() is called without a timeout. If the endpoint stalls, Spark tasks can hang indefinitely and tie up executors. Pass a reasonable timeout (and optionally retries/backoff) to make the read more reliable.

Copilot · 2026-02-07T15:55:38Z

+        region = self.options.get("region", None)
+        divisions = [region] if region else data.keys()
+


If a user passes an invalid region option, data[division] will raise a KeyError with little context. Validate region against the available keys (e.g., england-and-wales/scotland/northern-ireland) and raise a clear error message listing valid values.

Suggested change

region = self.options.get("region", None)

divisions = [region] if region else data.keys()

available_divisions = data.keys()

region = self.options.get("region", None)

if region is not None and region not in available_divisions:

valid = ", ".join(sorted(available_divisions))

raise ValueError(

f"Invalid region '{region}'. Valid regions are: {valid}"

)

divisions = [region] if region else available_divisions

Copilot · 2026-02-07T15:55:38Z

+    def schema(self):
+        return StructType([
+            StructField("division", StringType()),
+            StructField("title", StringType()),
+            StructField("date", StringType()),
+            StructField("notes", StringType()),
+            StructField("bunting", BooleanType()),
+        ])


The schema defines date as StringType, but it represents an ISO date. Using DateType (and parsing to it) avoids repeated casts in examples and ensures correct date semantics for comparisons/joins.

Copilot · 2026-02-07T15:55:38Z

+    .count() > 0
+)
+


The bank-holiday check uses .count() > 0, which forces a full scan/count. Use an existence check pattern (e.g., limit(1) / head(1) / take(1)) to avoid unnecessary work, especially if the source grows or is slower.

Suggested change

.count() > 0

)

)

is_bank_holiday = len(is_bank_holiday.limit(1).take(1)) > 0

Copilot · 2026-02-07T15:55:39Z

+        response = requests.get("https://www.gov.uk/bank-holidays.json")
+        response.raise_for_status()
+        data = response.json()


Same as earlier: requests.get() is called without a timeout in the full code block. Add a timeout (and optionally retries/backoff) so example code doesn’t hang indefinitely on network issues.

claude added 2 commits February 7, 2026 12:08

Add DailyDatabricks and databricks.news cross-promotion callout

c4107b2

https://claude.ai/code/session_01Xjvq3wun64343XnME8fvaw

Copilot AI review requested due to automatic review settings February 7, 2026 15:52

Copilot started reviewing on behalf of fusionet24 February 7, 2026 15:52 View session

Copilot AI reviewed Feb 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add blog post: Building a Custom PySpark Data Source with Spark's Python Data Source API#17

Add blog post: Building a Custom PySpark Data Source with Spark's Python Data Source API#17
fusionet24 wants to merge 2 commits into
mainfrom
claude/spark-datasource-api-tutorial-6Pp2Q

fusionet24 commented Feb 7, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 7, 2026

Uh oh!

Copilot AI Feb 7, 2026

Uh oh!

Copilot AI Feb 7, 2026

Uh oh!

Copilot AI Feb 7, 2026

Uh oh!

Copilot AI Feb 7, 2026

Uh oh!

Copilot AI Feb 7, 2026

Uh oh!

Copilot AI Feb 7, 2026

Uh oh!

Copilot AI Feb 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

-        divisions = [region] if region else data.keys()
+        available_divisions = list(data.keys())
+        if region:
+            if region not in available_divisions:
+                raise ValueError(
+                    f"Invalid region '{region}'. Supported regions are: "
+                    + ", ".join(sorted(available_divisions))
+                )
+            divisions = [region]
+        else:
+            divisions = available_divisions

		region = self.options.get("region", None)
		divisions = [region] if region else data.keys()

-        region = self.options.get("region", None)
-        divisions = [region] if region else data.keys()
+        available_divisions = data.keys()
+        region = self.options.get("region", None)
+        if region is not None and region not in available_divisions:
+            valid = ", ".join(sorted(available_divisions))
+            raise ValueError(
+                f"Invalid region '{region}'. Valid regions are: {valid}"
+            )
+        divisions = [region] if region else available_divisions

Conversation

fusionet24 commented Feb 7, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants