Skip to content

feat(tutorials): Document Prefect → Databricks SQL Warehouse integration (incl. prefect-databricks install) #480

@stefanko-ch

Description

@stefanko-ch

Summary

Document the Prefect → Databricks SQL Warehouse integration and close the one real infra gap: Prefect's default container does not bundle prefect-databricks, so the integration is not plug-and-play the way Kestra's is.

Parallel issue to #478 (same for Kestra). The Databricks-side story is identical (host, warehouse HTTP path, PAT from the existing nexus secret scope). The Nexus-side story differs in two points.

Differences from the Kestra issue

Dimension Kestra (#478) Prefect (this issue)
Plugin install Bundled in kestra/kestra:latest Not bundled. prefect-databricks must be installed separately.
Flow definition YAML, declarative Python, imperative
Secret management Flow env vars (.env via deploy.sh) Prefect Blocks (persisted in Prefect DB)
Retry/cold-start Flow-level YAML Task-decorator @task(retries=..., retry_delay_seconds=...)

What's missing

1. prefect-databricks in the Prefect worker environment

stacks/prefect/docker-compose.yml runs the official prefecthq/prefect:* image, which ships only Prefect core. The Databricks collection needs to land there for any Databricks flow to work.

Three realistic options:

  • (a) Custom DockerfileFROM prefecthq/prefect:3-latest-python3.12 with RUN pip install prefect-databricks. Same pattern used for Soda Core today. Recommended. Clean, versioned, reviewable in the repo. Image published under IMAGE_PREFECT or a new IMAGE_PREFECT_WORKER tag.
  • (b) Runtime pip install in a worker-pool init hook — flexible but slower on cold starts and brittle across restarts.
  • (c) User installs it in their own deployment's work pool — works but defeats the "batteries-included" Nexus-Stack promise for classroom use.

2. A DatabricksCredentials Block auto-provisioned

Prefect stores credentials in its internal "Blocks" registry, not as env vars. Ideal: scripts/deploy.sh calls Prefect's API after spin-up to seed a DatabricksCredentials block with host + PAT from the same KV/Infisical source that Kestra already uses. Students then DatabricksCredentials.load("nexus-databricks") from their flows — no token handling in flow code.

Gated on the Databricks Integrations config being saved; if it's not, skip silently like other optional bootstrap steps.

3. Tutorial under docs/tutorials/prefect/databricks-warehouse.md

Analogous to the Kestra one (see #478 for structure). Prefect-specific content:

from prefect import flow, task
from prefect_databricks import DatabricksCredentials
from prefect_databricks.queries import DatabricksSqlQuery

@task(retries=5, retry_delay_seconds=30)   # handles Free-Edition cold-start
def query_warehouse(creds, warehouse_id):
    return DatabricksSqlQuery(
        databricks_credentials=creds,
        warehouse_id=warehouse_id,
        query="SELECT current_timestamp(), current_catalog()",
    )

@flow
def classroom_flow():
    creds = DatabricksCredentials.load("nexus-databricks")
    result = query_warehouse(creds, warehouse_id="abc123def456")
    print(result)

if __name__ == "__main__":
    classroom_flow()

Prefect-Free-Edition notes identical to Kestra (cold-start, 25 GB quota, no Jobs-API deep work).

4. Warehouse HTTP path handling

Prefect's DatabricksSqlQuery takes warehouse_id directly (not httpPath); the collection constructs the HTTP path internally. Easier than Kestra — warehouse_id is the last path segment of the HTTP path, so the same HTTP_PATH secret landed by #478 is sufficient (extract the ID server-side or just store both).

Relation to #478

Issues should be kept separate because:

  • Different installation mechanics (bundled vs. pip-install).
  • Different credential-model (env vs. Block).
  • Different tutorial target audience (YAML-first vs. Python-first learners).

But both deserve consistent wording on:

  • The Databricks side (host, warehouse, PAT, where they come from).
  • The Free Edition caveats (cold-start retry, quota, no Jobs-API deep-dive).

Once both land, the Kestra and Prefect tutorials should cross-link: "prefer Kestra for YAML-first declarative DE; prefer Prefect for Python-first imperative DE. Both point at the same warehouse with the same credentials."

Related

Acceptance criteria

  • Custom Prefect image with prefect-databricks published under a new tag / env var.
  • scripts/deploy.sh auto-provisions a DatabricksCredentials block when Databricks Integrations is configured.
  • Tutorial at docs/tutorials/prefect/databricks-warehouse.md, linked from docs/tutorials/index.md and docs/tutorials/prefect/index.md (creating the Prefect category landing page if not yet present).
  • Student on a fresh spin-up with Databricks configured can paste the Python flow, save, run, and see query results — no pip install step, no manual credential setup.
  • Cold-start scenario covered via @task(retries=5, retry_delay_seconds=30) in the sample.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions