diff --git a/INSTRUMENTATION-EXTRACTION.md b/INSTRUMENTATION-EXTRACTION.md new file mode 100644 index 00000000..37378d9a --- /dev/null +++ b/INSTRUMENTATION-EXTRACTION.md @@ -0,0 +1,57 @@ +# Instrumentation extraction + +The SQL-plan **instrumentation** feature (per-operator `TimedExec` timing) has +been extracted out of this repo into a standalone project that depends on the +published `io.dataflint:dataflint-common` artifact and publishes its own +artifacts to GitHub Packages. + +The new project's seed lives at [`instrumentation-repo/`](instrumentation-repo/) +on this branch — see [`instrumentation-repo/NEW-REPO-SETUP.md`](instrumentation-repo/NEW-REPO-SETUP.md) +for how to lift it into `dataflint/spark-instrumentation`. Once that repo is +live, delete `instrumentation-repo/` from this branch. + +## Why it splits cleanly + +The wiring was already loosely coupled: + +* `DataflintSparkUICommonLoader.registerInstrumentationExtension` adds + `org.apache.spark.dataflint.DataFlintInstrumentationExtension` to + `spark.sql.extensions` **by class name** — no compile dependency on it. +* The extension's only inbound needs from common are the `INSTRUMENT_*` config + constants and `api.ResolvedStageGroup` — both **kept here**. +* `TimedExec`/`MetricsUtils`/`Accelerator`/`GraphDurationAttribution` had no + inbound references from the UI / listener / saas code. + +So at runtime users add the new instrumentation jar to the classpath; common +auto-registers it by name when an `spark.dataflint.instrument.*` flag is set. + +## Removed from this repo + +Moved to the new repo (`core` module): +* `plugin/src/main/scala/org/apache/spark/dataflint/TimedExec.scala` +* `plugin/src/main/scala/org/apache/spark/dataflint/MetricsUtils.scala` +* `plugin/src/main/scala/org/apache/spark/dataflint/Accelerator.scala` +* `plugin/src/main/scala/org/apache/spark/dataflint/GraphDurationAttribution.scala` + +Moved (version-specific extensions): +* `pluginspark3/.../DataFlintInstrumentationExtension.scala` → new `spark3` module +* `pluginspark4/.../DataFlintInstrumentationExtension.scala` → new `spark4` module + +Moved (entire `pluginspark3/src/test` — all of it was instrumentation tests): +* `TimedExecMetricsSpec`, `DataFlintCodegenExecSpec`, `DataFlintCodegenFallbackSpec`, + `DataFlintPythonExecSpec`, `DataFlintSqlNodesSpec`, `DataFlintWindowExecSpec`, + `DataFlintWriteMetricsSpec`, `GraphDurationAttributionSpec`, `GraphStageGroupSpec`, + `SqlMetricTestHelper`, and the `org/apache/spark/sql/execution/ExplicitRepartition.scala` + test helper. + +Build change: +* Removed the `pluginspark4` `Test / unmanagedSources` block that pulled in the + now-moved `DataFlintCodegenFallbackSpec`. + +## Deliberately kept here (the contract) + +* `DataflintSparkUICommonLoader`: the `INSTRUMENT_*` constants and + `registerInstrumentationExtension` (loads the extension by name). +* `api/DurationTypes.scala` (`ResolvedStageGroup`). +* The Delta Lake instrumentation listener (`DeltaLakeInstrumentationListener` & + friends) — out of scope for this split; it is compile-coupled to the loader. diff --git a/instrumentation-repo/.github/workflows/ci.yml b/instrumentation-repo/.github/workflows/ci.yml new file mode 100644 index 00000000..e6bc3665 --- /dev/null +++ b/instrumentation-repo/.github/workflows/ci.yml @@ -0,0 +1,28 @@ +name: CI + +on: + push: + branches: [ main ] + pull_request: + +jobs: + test: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v4 + + - uses: actions/setup-java@v4 + with: + distribution: temurin + java-version: 17 + cache: sbt + + - name: Set up sbt + uses: sbt/setup-sbt@v1 + + - name: Compile all modules + run: sbt core/compile spark3/compile spark4/compile + + # The full instrumentation regression suite runs against Spark 3.5. + - name: Test (Spark 3.x) + run: sbt spark3/test diff --git a/instrumentation-repo/.github/workflows/publish.yml b/instrumentation-repo/.github/workflows/publish.yml new file mode 100644 index 00000000..89a12a55 --- /dev/null +++ b/instrumentation-repo/.github/workflows/publish.yml @@ -0,0 +1,34 @@ +name: Publish to GitHub Packages + +on: + workflow_dispatch: + push: + tags: + - 'v*' + +permissions: + contents: read + packages: write + +jobs: + publish: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v4 + + - uses: actions/setup-java@v4 + with: + distribution: temurin + java-version: 17 + cache: sbt + + - name: Set up sbt + uses: sbt/setup-sbt@v1 + + # Cross-publishes core (2.12/2.13) + spark3 (2.12/2.13) + spark4 (2.13) + # to https://maven.pkg.github.com/dataflint/spark-instrumentation + - name: Publish + run: sbt +publish + env: + GITHUB_ACTOR: ${{ github.actor }} + GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} diff --git a/instrumentation-repo/.gitignore b/instrumentation-repo/.gitignore new file mode 100644 index 00000000..456b1f4f --- /dev/null +++ b/instrumentation-repo/.gitignore @@ -0,0 +1,9 @@ +target/ +project/target/ +project/project/ +.bsp/ +.idea/ +.metals/ +.bloop/ +*.class +*.log diff --git a/instrumentation-repo/LICENSE b/instrumentation-repo/LICENSE new file mode 100644 index 00000000..7a4a3ea2 --- /dev/null +++ b/instrumentation-repo/LICENSE @@ -0,0 +1,202 @@ + + Apache License + Version 2.0, January 2004 + http://www.apache.org/licenses/ + + TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION + + 1. Definitions. + + "License" shall mean the terms and conditions for use, reproduction, + and distribution as defined by Sections 1 through 9 of this document. + + "Licensor" shall mean the copyright owner or entity authorized by + the copyright owner that is granting the License. + + "Legal Entity" shall mean the union of the acting entity and all + other entities that control, are controlled by, or are under common + control with that entity. For the purposes of this definition, + "control" means (i) the power, direct or indirect, to cause the + direction or management of such entity, whether by contract or + otherwise, or (ii) ownership of fifty percent (50%) or more of the + outstanding shares, or (iii) beneficial ownership of such entity. + + "You" (or "Your") shall mean an individual or Legal Entity + exercising permissions granted by this License. + + "Source" form shall mean the preferred form for making modifications, + including but not limited to software source code, documentation + source, and configuration files. + + "Object" form shall mean any form resulting from mechanical + transformation or translation of a Source form, including but + not limited to compiled object code, generated documentation, + and conversions to other media types. + + "Work" shall mean the work of authorship, whether in Source or + Object form, made available under the License, as indicated by a + copyright notice that is included in or attached to the work + (an example is provided in the Appendix below). + + "Derivative Works" shall mean any work, whether in Source or Object + form, that is based on (or derived from) the Work and for which the + editorial revisions, annotations, elaborations, or other modifications + represent, as a whole, an original work of authorship. For the purposes + of this License, Derivative Works shall not include works that remain + separable from, or merely link (or bind by name) to the interfaces of, + the Work and Derivative Works thereof. + + "Contribution" shall mean any work of authorship, including + the original version of the Work and any modifications or additions + to that Work or Derivative Works thereof, that is intentionally + submitted to Licensor for inclusion in the Work by the copyright owner + or by an individual or Legal Entity authorized to submit on behalf of + the copyright owner. For the purposes of this definition, "submitted" + means any form of electronic, verbal, or written communication sent + to the Licensor or its representatives, including but not limited to + communication on electronic mailing lists, source code control systems, + and issue tracking systems that are managed by, or on behalf of, the + Licensor for the purpose of discussing and improving the Work, but + excluding communication that is conspicuously marked or otherwise + designated in writing by the copyright owner as "Not a Contribution." + + "Contributor" shall mean Licensor and any individual or Legal Entity + on behalf of whom a Contribution has been received by Licensor and + subsequently incorporated within the Work. + + 2. Grant of Copyright License. Subject to the terms and conditions of + this License, each Contributor hereby grants to You a perpetual, + worldwide, non-exclusive, no-charge, royalty-free, irrevocable + copyright license to reproduce, prepare Derivative Works of, + publicly display, publicly perform, sublicense, and distribute the + Work and such Derivative Works in Source or Object form. + + 3. Grant of Patent License. Subject to the terms and conditions of + this License, each Contributor hereby grants to You a perpetual, + worldwide, non-exclusive, no-charge, royalty-free, irrevocable + (except as stated in this section) patent license to make, have made, + use, offer to sell, sell, import, and otherwise transfer the Work, + where such license applies only to those patent claims licensable + by such Contributor that are necessarily infringed by their + Contribution(s) alone or by combination of their Contribution(s) + with the Work to which such Contribution(s) was submitted. If You + institute patent litigation against any entity (including a + cross-claim or counterclaim in a lawsuit) alleging that the Work + or a Contribution incorporated within the Work constitutes direct + or contributory patent infringement, then any patent licenses + granted to You under this License for that Work shall terminate + as of the date such litigation is filed. + + 4. Redistribution. You may reproduce and distribute copies of the + Work or Derivative Works thereof in any medium, with or without + modifications, and in Source or Object form, provided that You + meet the following conditions: + + (a) You must give any other recipients of the Work or + Derivative Works a copy of this License; and + + (b) You must cause any modified files to carry prominent notices + stating that You changed the files; and + + (c) You must retain, in the Source form of any Derivative Works + that You distribute, all copyright, patent, trademark, and + attribution notices from the Source form of the Work, + excluding those notices that do not pertain to any part of + the Derivative Works; and + + (d) If the Work includes a "NOTICE" text file as part of its + distribution, then any Derivative Works that You distribute must + include a readable copy of the attribution notices contained + within such NOTICE file, excluding those notices that do not + pertain to any part of the Derivative Works, in at least one + of the following places: within a NOTICE text file distributed + as part of the Derivative Works; within the Source form or + documentation, if provided along with the Derivative Works; or, + within a display generated by the Derivative Works, if and + wherever such third-party notices normally appear. The contents + of the NOTICE file are for informational purposes only and + do not modify the License. You may add Your own attribution + notices within Derivative Works that You distribute, alongside + or as an addendum to the NOTICE text from the Work, provided + that such additional attribution notices cannot be construed + as modifying the License. + + You may add Your own copyright statement to Your modifications and + may provide additional or different license terms and conditions + for use, reproduction, or distribution of Your modifications, or + for any such Derivative Works as a whole, provided Your use, + reproduction, and distribution of the Work otherwise complies with + the conditions stated in this License. + + 5. Submission of Contributions. Unless You explicitly state otherwise, + any Contribution intentionally submitted for inclusion in the Work + by You to the Licensor shall be under the terms and conditions of + this License, without any additional terms or conditions. + Notwithstanding the above, nothing herein shall supersede or modify + the terms of any separate license agreement you may have executed + with Licensor regarding such Contributions. + + 6. Trademarks. This License does not grant permission to use the trade + names, trademarks, service marks, or product names of the Licensor, + except as required for reasonable and customary use in describing the + origin of the Work and reproducing the content of the NOTICE file. + + 7. Disclaimer of Warranty. Unless required by applicable law or + agreed to in writing, Licensor provides the Work (and each + Contributor provides its Contributions) on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or + implied, including, without limitation, any warranties or conditions + of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A + PARTICULAR PURPOSE. You are solely responsible for determining the + appropriateness of using or redistributing the Work and assume any + risks associated with Your exercise of permissions under this License. + + 8. Limitation of Liability. In no event and under no legal theory, + whether in tort (including negligence), contract, or otherwise, + unless required by applicable law (such as deliberate and grossly + negligent acts) or agreed to in writing, shall any Contributor be + liable to You for damages, including any direct, indirect, special, + incidental, or consequential damages of any character arising as a + result of this License or out of the use or inability to use the + Work (including but not limited to damages for loss of goodwill, + work stoppage, computer failure or malfunction, or any and all + other commercial damages or losses), even if such Contributor + has been advised of the possibility of such damages. + + 9. Accepting Warranty or Additional Liability. While redistributing + the Work or Derivative Works thereof, You may choose to offer, + and charge a fee for, acceptance of support, warranty, indemnity, + or other liability obligations and/or rights consistent with this + License. However, in accepting such obligations, You may act only + on Your own behalf and on Your sole responsibility, not on behalf + of any other Contributor, and only if You agree to indemnify, + defend, and hold each Contributor harmless for any liability + incurred by, or claims asserted against, such Contributor by reason + of your accepting any such warranty or additional liability. + + END OF TERMS AND CONDITIONS + + APPENDIX: How to apply the Apache License to your work. + + To apply the Apache License to your work, attach the following + boilerplate notice, with the fields enclosed by brackets "[]" + replaced with your own identifying information. (Don't include + the brackets!) The text should be enclosed in the appropriate + comment syntax for the file format. We also recommend that a + file or class name and description of purpose be included on the + same "printed page" as the copyright notice for easier + identification within third-party archives. + + Copyright [yyyy] [name of copyright owner] + + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. \ No newline at end of file diff --git a/instrumentation-repo/NEW-REPO-SETUP.md b/instrumentation-repo/NEW-REPO-SETUP.md new file mode 100644 index 00000000..df75cd01 --- /dev/null +++ b/instrumentation-repo/NEW-REPO-SETUP.md @@ -0,0 +1,48 @@ +# Turning this folder into the `dataflint/spark-instrumentation` repo + +This `instrumentation-repo/` directory is a **self-contained seed** for a new, +standalone repository. It is committed inside the `dataflint/spark` branch +`claude/instrumentation-extraction-2K3Ip` only so the work isn't lost — it is +**not** meant to live here permanently. Move it out into its own GitHub repo: + +## 1. Create the GitHub repo + +Create an empty repo `dataflint/spark-instrumentation` on GitHub (no README, +no .gitignore — this seed already has them). + +## 2. Extract the seed into it + +```bash +# from a checkout of dataflint/spark on the extraction branch +cp -r instrumentation-repo /tmp/spark-instrumentation +cd /tmp/spark-instrumentation + +git init -b main +git add . +git commit -m "Initial import: SQL-plan instrumentation extracted from dataflint/spark" +git remote add origin git@github.com:dataflint/spark-instrumentation.git +git push -u origin main +``` + +## 3. Wire up publishing + +* The publish workflow (`.github/workflows/publish.yml`) uses the built-in + `GITHUB_TOKEN` with `packages: write` — no extra secrets needed. +* Confirm `dataflintCommonVersion` in `build.sbt` points at a **released** + `io.dataflint:dataflint-common` on Maven Central (not a `-SNAPSHOT`). The + extraction keeps the `INSTRUMENT_*` constants + `registerInstrumentationExtension` + + `api.ResolvedStageGroup` in `dataflint-common`, so any release at or after + the extraction point works. +* Tag `v0.1.0` to cut the first release. + +## 4. Remove this seed from dataflint/spark + +Once the new repo is live, delete `instrumentation-repo/` from the +`dataflint/spark` branch in a follow-up commit — it has served its purpose as a +handoff. + +## What was removed from dataflint/spark + +See the companion `INSTRUMENTATION-EXTRACTION.md` at the root of the +`dataflint/spark` branch for the exact list of files removed and what was +deliberately kept (the loose-coupling contract). diff --git a/instrumentation-repo/README.md b/instrumentation-repo/README.md new file mode 100644 index 00000000..d57a09a1 --- /dev/null +++ b/instrumentation-repo/README.md @@ -0,0 +1,96 @@ +# dataflint-instrumentation + +SQL-plan instrumentation for [DataFlint](https://github.com/dataflint/spark), +extracted into its own repository so it can ship and version independently of +the core DataFlint plugin. + +This module injects a `SparkSessionExtensions` / `ColumnarRule` that wraps +selected physical operators with `TimedExec`, adding a per-operator `duration` +metric to Spark's SQL UI (and the DataFlint UI). It coexists with native +accelerators (Gluten/Velox, Comet, RAPIDS, Photon) by running in +`postColumnarTransitions` and only wrapping operators no accelerator claimed. + +## How it relates to dataflint/spark + +``` +┌──────────────────────────────┐ ┌────────────────────────────────────┐ +│ dataflint/spark │ │ dataflint/spark-instrumentation │ +│ io.dataflint:dataflint-common│──Maven──▶│ depends on dataflint-common │ +│ • INSTRUMENT_* config keys │ Central │ (provided) │ +│ • registerInstrumentation() │ │ • DataFlintInstrumentationExtension │ +│ loads the extension │◀─runtime─│ (spark3 + spark4) │ +│ BY CLASS NAME │ classpath│ • TimedExec / MetricsUtils │ +│ • api.ResolvedStageGroup │ only │ • Accelerator │ +└──────────────────────────────┘ │ • GraphDurationAttribution │ + └────────────────────────────────────┘ +``` + +* **Build-time:** this repo depends on the published + `io.dataflint:dataflint-common` artifact for the `INSTRUMENT_*` config + constants and `api.ResolvedStageGroup`. +* **Runtime:** there is **no** code dependency from `dataflint-common` back to + this repo. `dataflint-common`'s `registerInstrumentationExtension` adds + `org.apache.spark.dataflint.DataFlintInstrumentationExtension` to + `spark.sql.extensions` **by class name**. As long as this jar is on the + classpath and an `spark.dataflint.instrument.*` flag is enabled, the + extension is picked up automatically. + +## Modules + +| Module | Scala | Spark | Artifact | +|-------------------------------------|-------------|-------|-------------------------------------------| +| `core` | 2.12 / 2.13 | 3.5 | `dataflint-instrumentation-core` | +| `spark3` | 2.12 / 2.13 | 3.x | `dataflint-instrumentation-spark3` (fat) | +| `spark4` | 2.13 | 4.x | `dataflint-instrumentation-spark4` (fat) | + +The `spark3` / `spark4` artifacts are fat JARs that bundle `core` (Scala and +the provided deps are excluded). + +## Usage + +Add the DataFlint plugin (which provides `dataflint-common`) **and** the +matching instrumentation jar to your Spark job, then enable instrumentation: + +``` +spark.plugins io.dataflint.spark.SparkDataflintPlugin +spark.dataflint.instrument.spark.enabled true +``` + +Per-feature flags (any one enables the extension): see `INSTRUMENT_*` keys in +`DataflintSparkUICommonLoader`, e.g. +`spark.dataflint.instrument.spark.sqlNodes.enabled`, +`spark.dataflint.instrument.spark.window.enabled`, +`spark.dataflint.instrument.spark.mapInPandas.enabled`, … + +## Consuming the artifacts (GitHub Packages) + +```scala +resolvers += "dataflint-instrumentation" at + "https://maven.pkg.github.com/dataflint/spark-instrumentation" + +credentials += Credentials( + "GitHub Package Registry", + "maven.pkg.github.com", + sys.env("GITHUB_ACTOR"), + sys.env("GITHUB_TOKEN") // a PAT with read:packages +) + +libraryDependencies += "io.dataflint" %% "dataflint-instrumentation-spark3" % "0.1.0" +``` + +> GitHub Packages requires authentication even for public packages — consumers +> need a token with `read:packages`. + +## Build & test + +```bash +sbt core/compile spark3/compile spark4/compile +sbt spark3/test # full regression suite, runs against Spark 3.5 (Java 17) +sbt +publish # cross-publish to GitHub Packages (CI does this) +``` + +## Releasing + +Publishing happens via `.github/workflows/publish.yml` on a `v*` tag or manual +dispatch, using the built-in `GITHUB_TOKEN`. Bump `versionNum` in `build.sbt` +and tag `vX.Y.Z`. diff --git a/instrumentation-repo/build.sbt b/instrumentation-repo/build.sbt new file mode 100644 index 00000000..27bc091d --- /dev/null +++ b/instrumentation-repo/build.sbt @@ -0,0 +1,137 @@ +import sbtassembly.AssemblyPlugin.autoImport._ + +// --------------------------------------------------------------------------- +// dataflint-instrumentation +// +// SQL-plan instrumentation for DataFlint, extracted from dataflint/spark. +// Depends on the published `io.dataflint:dataflint-common` artifact (Maven +// Central) for the INSTRUMENT_* config contract and api.ResolvedStageGroup, +// and is wired into Spark at runtime by dataflint-common's +// `registerInstrumentationExtension` (which loads +// `org.apache.spark.dataflint.DataFlintInstrumentationExtension` by name). +// +// Like the upstream dataflint/spark build, the version-portable sources in +// `core/src/main/scala` are shared into each version module via +// `unmanagedSourceDirectories` rather than a cross-built project dependency +// (a Scala-2.13-only module can't `dependsOn` a 2.12/2.13 cross-built one). +// +// Artifacts are published to GitHub Packages: +// https://maven.pkg.github.com/dataflint/spark-instrumentation +// --------------------------------------------------------------------------- + +lazy val versionNum: String = "0.1.0" + +lazy val scala212 = "2.12.20" +lazy val scala213 = "2.13.16" +lazy val supportedScalaVersions = List(scala212, scala213) + +// Version of the upstream dataflint-common artifact that ships the INSTRUMENT_* +// constants + registerInstrumentationExtension + api.ResolvedStageGroup. +// Must be a *released* version on Maven Central (not a -SNAPSHOT). +// +// 0.9.9 is the latest release at extraction time and already carries the +// contract — but it STILL bundles TimedExec/Accelerator. Bump this to the +// first dataflint/spark release cut *after* this extraction (>= 0.9.10), which +// drops those classes, so the plugin jar and this jar don't both carry them. +lazy val dataflintCommonVersion = "0.9.9" + +ThisBuild / organization := "io.dataflint" +ThisBuild / version := versionNum + +// ---- GitHub Packages publishing ---------------------------------------------- +ThisBuild / publishMavenStyle := true +ThisBuild / licenses := Seq("APL2" -> url("http://www.apache.org/licenses/LICENSE-2.0.txt")) +ThisBuild / homepage := Some(url("https://github.com/dataflint/spark-instrumentation")) +ThisBuild / publishTo := Some( + "GitHub Packages" at "https://maven.pkg.github.com/dataflint/spark-instrumentation" +) +// In GitHub Actions, GITHUB_ACTOR + GITHUB_TOKEN are provided automatically. +// Locally, export GITHUB_ACTOR and a PAT with write:packages as GITHUB_TOKEN. +ThisBuild / credentials += Credentials( + "GitHub Package Registry", + "maven.pkg.github.com", + sys.env.getOrElse("GITHUB_ACTOR", "dataflint"), + sys.env.getOrElse("GITHUB_TOKEN", "") +) + +// Shared assembly config: publish a fat JAR containing the instrumentation +// core + the version-specific extension. Scala and the provided deps +// (Spark, dataflint-common) are excluded — they are already on the user's +// classpath via Spark and the DataFlint plugin jar. +lazy val instrumentationAssembly = Seq( + assembly / assemblyJarName := s"${name.value}_${scalaBinaryVersion.value}-${version.value}.jar", + assembly / assemblyOption := (assembly / assemblyOption).value.withIncludeScala(false), + assembly / assemblyMergeStrategy := { + case PathList("META-INF", "services", xs @ _*) => MergeStrategy.concat + case PathList("META-INF", xs @ _*) => MergeStrategy.discard + case "application.conf" => MergeStrategy.concat + case "reference.conf" => MergeStrategy.concat + case _ => MergeStrategy.first + }, + // Publish the assembled JAR instead of the thin one. + Compile / packageBin := assembly.value, + // Bundle the shared, version-portable core sources (TimedExec, MetricsUtils, + // Accelerator, GraphDurationAttribution) into this module's compile/jar. + Compile / unmanagedSourceDirectories += (LocalRootProject / baseDirectory).value / "core" / "src" / "main" / "scala" +) + +lazy val sparkTestJavaOptions = Seq( + Test / fork := true, + // Suites share a SparkSession via getOrCreate(); run sequentially so one + // suite stopping the session doesn't NPE another. + Test / parallelExecution := false, + Test / javaOptions ++= { + if (sys.props("java.specification.version").startsWith("1.")) Seq.empty + else Seq( + "--add-opens=java.base/java.lang=ALL-UNNAMED", + "--add-opens=java.base/java.nio=ALL-UNNAMED", + "--add-opens=java.base/sun.nio.ch=ALL-UNNAMED", + "--add-opens=java.base/java.util=ALL-UNNAMED", + "--add-opens=java.base/java.io=ALL-UNNAMED" + ) + } +) + +lazy val root = (project in file(".")) + .aggregate(spark3, spark4) + .settings( + name := "dataflint-instrumentation", + crossScalaVersions := Nil, // aggregate must be Nil for cross-building + publish / skip := true + ) + +// Spark 3.x SparkSessionExtensions + ColumnarRule, plus the full regression suite. +lazy val spark3 = (project in file("spark3")) + .enablePlugins(AssemblyPlugin) + .settings(instrumentationAssembly) + .settings(sparkTestJavaOptions) + .settings( + name := "dataflint-instrumentation-spark3", + crossScalaVersions := supportedScalaVersions, + libraryDependencies ++= Seq( + "org.apache.spark" %% "spark-core" % "3.5.1" % "provided", + "org.apache.spark" %% "spark-sql" % "3.5.1" % "provided", + "io.dataflint" %% "dataflint-common" % dataflintCommonVersion % "provided", + // tests + "org.scalatest" %% "scalatest-funsuite" % "3.2.17" % Test, + "org.scalatest" %% "scalatest-shouldmatchers" % "3.2.17" % Test, + "org.apache.spark" %% "spark-core" % "3.5.1" % Test, + "org.apache.spark" %% "spark-sql" % "3.5.1" % Test, + "io.dataflint" %% "dataflint-common" % dataflintCommonVersion % Test + ) + ) + +// Spark 4.x SparkSessionExtensions + ColumnarRule (Scala 2.13 only). +lazy val spark4 = (project in file("spark4")) + .enablePlugins(AssemblyPlugin) + .settings(instrumentationAssembly) + .settings( + name := "dataflint-instrumentation-spark4", + scalaVersion := scala213, + crossScalaVersions := List(scala213), + libraryDependencies ++= Seq( + "org.apache.spark" %% "spark-core" % "4.0.1" % "provided", + "org.apache.spark" %% "spark-sql" % "4.0.1" % "provided", + "io.dataflint" %% "dataflint-common" % dataflintCommonVersion % "provided" + ) + ) diff --git a/spark-plugin/plugin/src/main/scala/org/apache/spark/dataflint/Accelerator.scala b/instrumentation-repo/core/src/main/scala/org/apache/spark/dataflint/Accelerator.scala similarity index 100% rename from spark-plugin/plugin/src/main/scala/org/apache/spark/dataflint/Accelerator.scala rename to instrumentation-repo/core/src/main/scala/org/apache/spark/dataflint/Accelerator.scala diff --git a/spark-plugin/plugin/src/main/scala/org/apache/spark/dataflint/GraphDurationAttribution.scala b/instrumentation-repo/core/src/main/scala/org/apache/spark/dataflint/GraphDurationAttribution.scala similarity index 100% rename from spark-plugin/plugin/src/main/scala/org/apache/spark/dataflint/GraphDurationAttribution.scala rename to instrumentation-repo/core/src/main/scala/org/apache/spark/dataflint/GraphDurationAttribution.scala diff --git a/spark-plugin/plugin/src/main/scala/org/apache/spark/dataflint/MetricsUtils.scala b/instrumentation-repo/core/src/main/scala/org/apache/spark/dataflint/MetricsUtils.scala similarity index 100% rename from spark-plugin/plugin/src/main/scala/org/apache/spark/dataflint/MetricsUtils.scala rename to instrumentation-repo/core/src/main/scala/org/apache/spark/dataflint/MetricsUtils.scala diff --git a/spark-plugin/plugin/src/main/scala/org/apache/spark/dataflint/TimedExec.scala b/instrumentation-repo/core/src/main/scala/org/apache/spark/dataflint/TimedExec.scala similarity index 100% rename from spark-plugin/plugin/src/main/scala/org/apache/spark/dataflint/TimedExec.scala rename to instrumentation-repo/core/src/main/scala/org/apache/spark/dataflint/TimedExec.scala diff --git a/instrumentation-repo/project/build.properties b/instrumentation-repo/project/build.properties new file mode 100644 index 00000000..abbbce5d --- /dev/null +++ b/instrumentation-repo/project/build.properties @@ -0,0 +1 @@ +sbt.version=1.9.8 diff --git a/instrumentation-repo/project/plugins.sbt b/instrumentation-repo/project/plugins.sbt new file mode 100644 index 00000000..d83c8830 --- /dev/null +++ b/instrumentation-repo/project/plugins.sbt @@ -0,0 +1 @@ +addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "2.1.5") diff --git a/spark-plugin/pluginspark3/src/main/scala/org/apache/spark/dataflint/DataFlintInstrumentationExtension.scala b/instrumentation-repo/spark3/src/main/scala/org/apache/spark/dataflint/DataFlintInstrumentationExtension.scala similarity index 100% rename from spark-plugin/pluginspark3/src/main/scala/org/apache/spark/dataflint/DataFlintInstrumentationExtension.scala rename to instrumentation-repo/spark3/src/main/scala/org/apache/spark/dataflint/DataFlintInstrumentationExtension.scala diff --git a/spark-plugin/pluginspark3/src/test/scala/org/apache/spark/dataflint/DataFlintCodegenExecSpec.scala b/instrumentation-repo/spark3/src/test/scala/org/apache/spark/dataflint/DataFlintCodegenExecSpec.scala similarity index 100% rename from spark-plugin/pluginspark3/src/test/scala/org/apache/spark/dataflint/DataFlintCodegenExecSpec.scala rename to instrumentation-repo/spark3/src/test/scala/org/apache/spark/dataflint/DataFlintCodegenExecSpec.scala diff --git a/spark-plugin/pluginspark3/src/test/scala/org/apache/spark/dataflint/DataFlintCodegenFallbackSpec.scala b/instrumentation-repo/spark3/src/test/scala/org/apache/spark/dataflint/DataFlintCodegenFallbackSpec.scala similarity index 100% rename from spark-plugin/pluginspark3/src/test/scala/org/apache/spark/dataflint/DataFlintCodegenFallbackSpec.scala rename to instrumentation-repo/spark3/src/test/scala/org/apache/spark/dataflint/DataFlintCodegenFallbackSpec.scala diff --git a/spark-plugin/pluginspark3/src/test/scala/org/apache/spark/dataflint/DataFlintPythonExecSpec.scala b/instrumentation-repo/spark3/src/test/scala/org/apache/spark/dataflint/DataFlintPythonExecSpec.scala similarity index 100% rename from spark-plugin/pluginspark3/src/test/scala/org/apache/spark/dataflint/DataFlintPythonExecSpec.scala rename to instrumentation-repo/spark3/src/test/scala/org/apache/spark/dataflint/DataFlintPythonExecSpec.scala diff --git a/spark-plugin/pluginspark3/src/test/scala/org/apache/spark/dataflint/DataFlintSqlNodesSpec.scala b/instrumentation-repo/spark3/src/test/scala/org/apache/spark/dataflint/DataFlintSqlNodesSpec.scala similarity index 100% rename from spark-plugin/pluginspark3/src/test/scala/org/apache/spark/dataflint/DataFlintSqlNodesSpec.scala rename to instrumentation-repo/spark3/src/test/scala/org/apache/spark/dataflint/DataFlintSqlNodesSpec.scala diff --git a/spark-plugin/pluginspark3/src/test/scala/org/apache/spark/dataflint/DataFlintWindowExecSpec.scala b/instrumentation-repo/spark3/src/test/scala/org/apache/spark/dataflint/DataFlintWindowExecSpec.scala similarity index 100% rename from spark-plugin/pluginspark3/src/test/scala/org/apache/spark/dataflint/DataFlintWindowExecSpec.scala rename to instrumentation-repo/spark3/src/test/scala/org/apache/spark/dataflint/DataFlintWindowExecSpec.scala diff --git a/spark-plugin/pluginspark3/src/test/scala/org/apache/spark/dataflint/DataFlintWriteMetricsSpec.scala b/instrumentation-repo/spark3/src/test/scala/org/apache/spark/dataflint/DataFlintWriteMetricsSpec.scala similarity index 100% rename from spark-plugin/pluginspark3/src/test/scala/org/apache/spark/dataflint/DataFlintWriteMetricsSpec.scala rename to instrumentation-repo/spark3/src/test/scala/org/apache/spark/dataflint/DataFlintWriteMetricsSpec.scala diff --git a/spark-plugin/pluginspark3/src/test/scala/org/apache/spark/dataflint/GraphDurationAttributionSpec.scala b/instrumentation-repo/spark3/src/test/scala/org/apache/spark/dataflint/GraphDurationAttributionSpec.scala similarity index 100% rename from spark-plugin/pluginspark3/src/test/scala/org/apache/spark/dataflint/GraphDurationAttributionSpec.scala rename to instrumentation-repo/spark3/src/test/scala/org/apache/spark/dataflint/GraphDurationAttributionSpec.scala diff --git a/spark-plugin/pluginspark3/src/test/scala/org/apache/spark/dataflint/GraphStageGroupSpec.scala b/instrumentation-repo/spark3/src/test/scala/org/apache/spark/dataflint/GraphStageGroupSpec.scala similarity index 100% rename from spark-plugin/pluginspark3/src/test/scala/org/apache/spark/dataflint/GraphStageGroupSpec.scala rename to instrumentation-repo/spark3/src/test/scala/org/apache/spark/dataflint/GraphStageGroupSpec.scala diff --git a/spark-plugin/pluginspark3/src/test/scala/org/apache/spark/dataflint/SqlMetricTestHelper.scala b/instrumentation-repo/spark3/src/test/scala/org/apache/spark/dataflint/SqlMetricTestHelper.scala similarity index 100% rename from spark-plugin/pluginspark3/src/test/scala/org/apache/spark/dataflint/SqlMetricTestHelper.scala rename to instrumentation-repo/spark3/src/test/scala/org/apache/spark/dataflint/SqlMetricTestHelper.scala diff --git a/spark-plugin/pluginspark3/src/test/scala/org/apache/spark/dataflint/TimedExecMetricsSpec.scala b/instrumentation-repo/spark3/src/test/scala/org/apache/spark/dataflint/TimedExecMetricsSpec.scala similarity index 100% rename from spark-plugin/pluginspark3/src/test/scala/org/apache/spark/dataflint/TimedExecMetricsSpec.scala rename to instrumentation-repo/spark3/src/test/scala/org/apache/spark/dataflint/TimedExecMetricsSpec.scala diff --git a/spark-plugin/pluginspark3/src/test/scala/org/apache/spark/sql/execution/ExplicitRepartition.scala b/instrumentation-repo/spark3/src/test/scala/org/apache/spark/sql/execution/ExplicitRepartition.scala similarity index 100% rename from spark-plugin/pluginspark3/src/test/scala/org/apache/spark/sql/execution/ExplicitRepartition.scala rename to instrumentation-repo/spark3/src/test/scala/org/apache/spark/sql/execution/ExplicitRepartition.scala diff --git a/spark-plugin/pluginspark4/src/main/scala/org/apache/spark/dataflint/DataFlintInstrumentationExtension.scala b/instrumentation-repo/spark4/src/main/scala/org/apache/spark/dataflint/DataFlintInstrumentationExtension.scala similarity index 100% rename from spark-plugin/pluginspark4/src/main/scala/org/apache/spark/dataflint/DataFlintInstrumentationExtension.scala rename to instrumentation-repo/spark4/src/main/scala/org/apache/spark/dataflint/DataFlintInstrumentationExtension.scala diff --git a/spark-plugin/build.sbt b/spark-plugin/build.sbt index 5715a85a..c00f3d31 100644 --- a/spark-plugin/build.sbt +++ b/spark-plugin/build.sbt @@ -175,17 +175,9 @@ lazy val pluginspark4 = (project in file("pluginspark4")) libraryDependencies += "org.apache.spark" %% "spark-core" % "4.0.1" % Test, libraryDependencies += "org.apache.spark" %% "spark-sql" % "4.0.1" % Test, - // Share version-portable test sources with pluginspark3. Most pluginspark3 specs - // depend on Spark-3-only internals (Dataset constructor, PythonMapInArrowExec, etc.) - // and don't compile against Spark 4, so we explicitly include only the suites that - // exercise version-stable surface area. + // Test sources can see the shared plugin (common) sources. + // (Instrumentation specs were moved to dataflint/spark-instrumentation.) Test / unmanagedSourceDirectories += (plugin / Compile / sourceDirectory).value / "scala", - Test / unmanagedSources ++= { - val pluginspark3Tests = (pluginspark3 / Test / sourceDirectory).value / "scala" - Seq( - pluginspark3Tests / "org" / "apache" / "spark" / "dataflint" / "DataFlintCodegenFallbackSpec.scala" - ) - }, // Fork JVM for tests; Spark on Java 9+ requires the same --add-opens as pluginspark3. Test / fork := true, diff --git a/spark-plugin/plugin/src/main/scala/org/apache/spark/dataflint/DataflintSparkUICommonLoader.scala b/spark-plugin/plugin/src/main/scala/org/apache/spark/dataflint/DataflintSparkUICommonLoader.scala index 58cbdd23..263cb171 100644 --- a/spark-plugin/plugin/src/main/scala/org/apache/spark/dataflint/DataflintSparkUICommonLoader.scala +++ b/spark-plugin/plugin/src/main/scala/org/apache/spark/dataflint/DataflintSparkUICommonLoader.scala @@ -134,6 +134,11 @@ class DataflintSparkUICommonInstaller extends Logging { object DataflintSparkUICommonLoader extends Logging { + // The SQL-plan instrumentation extension lives in a separate artifact + // (dataflint/spark-instrumentation, io.dataflint:dataflint-instrumentation-sparkN). + // It is loaded BY NAME via spark.sql.extensions, so there is no compile-time + // dependency on it here. The INSTRUMENT_* keys below are the config contract + // that artifact reads; keep them in sync across both repos. private val DATAFLINT_EXTENSION_CLASS = "org.apache.spark.dataflint.DataFlintInstrumentationExtension" val INSTRUMENT_SPARK_ENABLED = "spark.dataflint.instrument.spark.enabled" val INSTRUMENT_MAP_IN_PANDAS_ENABLED = "spark.dataflint.instrument.spark.mapInPandas.enabled"