Add Spark-based data generation for K8s (replaces MapReduce) by wjxiz1992 · Pull Request #248 · NVIDIA/spark-rapids-benchmarks

wjxiz1992 · 2026-02-13T06:38:30Z

Summary

This PR adds Kubernetes-native support for TPC-DS data generation by introducing a new PySpark-based approach (nds_gen_data_spark.py) that replaces the Hadoop MapReduce method. The dsdgen binary is distributed to Spark executors via --archives, enabling parallel data generation on any Spark cluster manager (K8s, YARN, Standalone, local) without requiring YARN or MapReduce.

Changes

Core

nds/nds_gen_data_spark.py — New PySpark application that:
- Distributes dsdgen binary across Spark executors via --archives
- Runs data generation in parallel (one dsdgen child per Spark task)
- Streams output line-by-line to avoid memory accumulation
- Writes to any Hadoop-compatible filesystem (HDFS, S3, GCS, ABFS, local)
- Renames Hive-style table_name=xxx directories to plain xxx for NDS pipeline compatibility
- Supports --range for incremental generation, --overwrite, and --update for maintenance datasets

Dockerfiles

nds/tpcds-gen/Dockerfile.dsdgen — Multi-stage build for cross-compiling dsdgen on Linux (handles GCC 10+ -fcommon issue)
nds/Dockerfile.k8s-test — Layers datagen scripts onto the official Spark Python image (recommended, uses ARG BASE_IMAGE)
nds/Dockerfile.spark-k8s — Standalone full Spark + PySpark image from a Spark distribution (build context = $SPARK_HOME)

Configuration

nds/datagen_submit.template — spark-submit template for use with spark-submit-template utility

Documentation

nds/README.md — Comprehensive new sections:
- Spark-based data generation usage and arguments
- K8s + HDFS prerequisites checklist
- Cross-compiling dsdgen for Linux with Dockerfile.dsdgen
- Comparison table of two Dockerfile options with step-by-step build instructions
- Local K8s testing guide with minikube (PVC-based, fully manual steps)

Testing

Verified on local minikube (Docker driver, ARM64):

Built dsdgen for Linux ARM64 via Dockerfile.dsdgen
Built Spark K8s image with Dockerfile.k8s-test
Ran nds_gen_data_spark.py with local[2] mode inside a K8s pod (scale=1, parallel=2)
All 25 TPC-DS source tables generated successfully (1.2 GB total)
Directory structure verified: plain table names (not Hive-style partitions)
Data content verified: pipe-delimited format matching TPC-DS specification

Migration Notes

The existing nds_gen_data.py (MapReduce-based) is unchanged — this PR adds a parallel path, not a replacement
For K8s environments or clusters without MapReduce, use nds_gen_data_spark.py
For YARN environments with MapReduce available, either approach works

Signed-off-by: Allen Xu <allxu@nvidia.com>

greptile-apps · 2026-02-13T06:40:38Z

Greptile Summary

This PR adds K8s-native TPC-DS data generation support via a new PySpark-based approach (nds_gen_data_spark.py) that replaces Hadoop MapReduce. The implementation distributes the dsdgen binary across Spark executors using the --archives mechanism and supports any Hadoop-compatible filesystem (HDFS, S3, GCS, ABFS, local).

Key additions:

nds_gen_data_spark.py — PySpark script for distributed data generation with incremental generation support via --range flag
Dockerfile.spark-k8s and Dockerfile.k8s-test — Two Docker build options for K8s deployment (standalone vs. layered)
Dockerfile.dsdgen — Cross-compilation support for building Linux dsdgen binaries from macOS
datagen_submit.template — Spark-submit template configuration
Comprehensive K8s documentation with minikube testing walkthrough

Review findings:
The code addresses most previous review feedback (py4j symlink error handling, PVC wait command, documentation improvements). A few minor items from previous reviews remain unaddressed but are not critical. The implementation is well-structured with good error handling and memory efficiency (streaming line-by-line processing).

Confidence Score: 4/5

Safe to merge with minor items from previous reviews still outstanding
Score reflects solid implementation quality with comprehensive documentation and testing instructions. Most critical issues from previous reviews have been addressed (py4j error handling, PVC waits, documentation). Remaining items are minor (casing conventions, archive naming consistency) and do not affect functionality. The core Spark-based generation logic is sound with proper error handling, memory-efficient streaming, and good validation.
Pay attention to nds/nds_gen_data_spark.py — ensure previous review comments about errorifexists casing and archive format consistency are addressed before final merge

Important Files Changed

Filename	Overview
nds/nds_gen_data_spark.py	New PySpark-based data generation script; previous review comments on `errorifexists` casing and archive format mismatches, but core logic is sound with good error handling
nds/Dockerfile.spark-k8s	Standalone Spark + PySpark image with improved py4j symlink error handling from previous review feedback; well-documented and production-ready
nds/Dockerfile.k8s-test	Simple layered image for quick setup; includes helpful build instructions in comments addressing previous review feedback
nds/README.md	Comprehensive K8s data generation documentation with minikube testing instructions; includes PVC wait command from previous review feedback

Sequence Diagram

sequenceDiagram
    participant User
    participant Driver as Spark Driver<br/>(K8s Pod)
    participant Executor as Spark Executor<br/>(K8s Pod)
    participant FS as Filesystem<br/>(HDFS/S3/etc)
    
    User->>Driver: spark-submit with --archives dsdgen.tar.gz#dsdgen
    Driver->>Driver: Parse args (scale, parallel, range)
    Driver->>Driver: Create RDD with child indices
    Driver->>Executor: Distribute archive + tasks
    
    Executor->>Executor: Extract dsdgen.tar.gz to SparkFiles
    Executor->>Executor: run_dsdgen_and_read(child_index)
    Executor->>Executor: subprocess.run(dsdgen -child N)
    Executor->>Executor: Stream .dat files line-by-line
    Executor->>FS: Write partitioned data (table_name=xxx/)
    
    Executor-->>Driver: Task complete
    
    Driver->>Driver: All tasks finished
    Driver->>FS: rename_partition_dirs()
    Driver->>FS: Rename table_name=xxx to xxx/
    FS-->>Driver: Rename complete
    
    Driver->>User: Data generation complete

_{Last reviewed commit: 3160717}

greptile-apps

_{6 files reviewed, 7 comments}

_{Edit Code Review Agent Settings | Greptile}

Copilot

Pull request overview

This PR introduces a new Spark-based approach for TPC-DS data generation that works with Kubernetes and other Spark cluster managers, complementing the existing Hadoop MapReduce-based approach. The implementation distributes the dsdgen binary via Spark's --archives mechanism and generates data in parallel across Spark executors.

Changes:

Added nds_gen_data_spark.py: PySpark application that replaces MapReduce approach for data generation, supporting K8s, YARN, Standalone, and local modes
Updated README with documentation and examples for the new Spark-based data generation workflow

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.

File	Description
nds/nds_gen_data_spark.py	New PySpark script implementing distributed TPC-DS data generation via Spark executors with support for incremental generation and multiple filesystem backends
nds/README.md	Added comprehensive documentation section explaining Spark-based data generation with usage examples for K8s and other cluster managers

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Signed-off-by: Allen Xu <allxu@nvidia.com>

greptile-apps

_{4 files reviewed, 7 comments}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Allen Xu <allxu@nvidia.com>

greptile-apps

_{4 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

- Use dsdgen.tar.gz instead of dsdgen.jar for archive format consistency - Add return value checks in rename_partition_dirs to prevent silent failures - Use shutil.rmtree for robust temp directory cleanup - Add encoding="utf-8" to file open calls - Fix Partitions print alignment Made-with: Cursor

greptile-apps · 2026-02-27T02:22:34Z

+        if len(parts) != 2:
+            print("ERROR: --range must be 'start,end'", file=sys.stderr)
+            sys.exit(1)
+        range_start, range_end = int(parts[0]), int(parts[1])


Missing exception handling for int() conversion - passing non-numeric values like --range abc,def will crash with unclear ValueError

Suggested change

range_start, range_end = int(parts[0]), int(parts[1])

try:

range_start, range_end = int(parts[0]), int(parts[1])

except ValueError as e:

print(f"ERROR: --range values must be integers: {e}", file=sys.stderr)

sys.exit(1)

- Add tpcds-gen/Dockerfile.dsdgen for cross-compiling dsdgen on Linux - Add Dockerfile.k8s-test for building Spark K8s image with datagen - Remove k8s_datagen_smoketest.sh script - Update README with cross-compilation, image build, and minikube testing instructions Made-with: Cursor

greptile-apps · 2026-02-27T05:16:22Z

+# Set up Spark directory structure
+ENV SPARK_HOME=/opt/spark
+ENV PATH="${SPARK_HOME}/bin:${PATH}"
+ENV PYTHONPATH="${SPARK_HOME}/python:${SPARK_HOME}/python/lib/py4j-0.10.9.7-src.zip:${PYTHONPATH}"


Hardcoded py4j-0.10.9.7-src.zip version assumes Spark 3.5.x - will break with other Spark versions that bundle different py4j versions

- Add try/except for --range int() parsing with clear error message - Remove hardcoded py4j version in Dockerfile.spark-k8s using symlink - Add base image build instructions and ARG to Dockerfile.k8s-test - Add kubectl wait for PVC binding in README minikube guide Made-with: Cursor

- Add guard in Dockerfile.spark-k8s to fail build if no py4j zip found - Add usage instructions and build context note to Dockerfile.spark-k8s - Document both Dockerfile options (k8s-test vs spark-k8s) in README with comparison table and step-by-step instructions Made-with: Cursor

wjxiz1992 · 2026-02-27T09:39:25Z

@copilot code review[agent] review

init support for k8s data gen

bb496fc

Signed-off-by: Allen Xu <allxu@nvidia.com>

Copilot AI review requested due to automatic review settings February 13, 2026 06:38

Copilot started reviewing on behalf of wjxiz1992 February 13, 2026 06:38 View session

greptile-apps Bot reviewed Feb 13, 2026

View reviewed changes

Comment thread nds/nds_gen_data_spark.py Outdated

Copilot AI reviewed Feb 13, 2026

View reviewed changes

Comment thread nds/nds_gen_data_spark.py Outdated

Comment thread nds/README.md Outdated

Comment thread nds/nds_gen_data_spark.py Outdated

Comment thread nds/nds_gen_data_spark.py Outdated

Comment thread nds/nds_gen_data_spark.py Outdated

Comment thread nds/nds_gen_data_spark.py Outdated

add hdfs related readme

ae55e43

Signed-off-by: Allen Xu <allxu@nvidia.com>

greptile-apps Bot reviewed Feb 13, 2026

View reviewed changes

resolve comments

0cf1830

Signed-off-by: Allen Xu <allxu@nvidia.com>

greptile-apps Bot reviewed Feb 13, 2026

View reviewed changes

Comment thread nds/nds_gen_data_spark.py

greptile-apps Bot reviewed Feb 27, 2026

View reviewed changes

Comment thread nds/Dockerfile.spark-k8s Outdated

Comment thread nds/Dockerfile.spark-k8s Outdated

wjxiz1992 changed the title ~~[draft]init support for k8s data gen~~ Init support for k8s data gen Feb 27, 2026

wjxiz1992 changed the title ~~Init support for k8s data gen~~ Add Spark-based data generation for K8s (replaces MapReduce) Feb 27, 2026

Conversation

wjxiz1992 commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Core

Dockerfiles

Configuration

Documentation

Testing

Migration Notes

Uh oh!

greptile-apps Bot commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps Bot Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wjxiz1992 commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wjxiz1992 commented Feb 13, 2026 •

edited

Loading

greptile-apps Bot commented Feb 13, 2026 •

edited

Loading

greptile-apps Bot left a comment •

edited

Loading