Skip to content

Add Spark-based data generation for K8s (replaces MapReduce)#248

Open
wjxiz1992 wants to merge 7 commits intoNVIDIA:devfrom
wjxiz1992:data-gen-k8s
Open

Add Spark-based data generation for K8s (replaces MapReduce)#248
wjxiz1992 wants to merge 7 commits intoNVIDIA:devfrom
wjxiz1992:data-gen-k8s

Conversation

@wjxiz1992
Copy link
Copy Markdown
Collaborator

@wjxiz1992 wjxiz1992 commented Feb 13, 2026

Summary

This PR adds Kubernetes-native support for TPC-DS data generation by introducing a new PySpark-based approach (nds_gen_data_spark.py) that replaces the Hadoop MapReduce method. The dsdgen binary is distributed to Spark executors via --archives, enabling parallel data generation on any Spark cluster manager (K8s, YARN, Standalone, local) without requiring YARN or MapReduce.

Changes

Core

  • nds/nds_gen_data_spark.py — New PySpark application that:
    • Distributes dsdgen binary across Spark executors via --archives
    • Runs data generation in parallel (one dsdgen child per Spark task)
    • Streams output line-by-line to avoid memory accumulation
    • Writes to any Hadoop-compatible filesystem (HDFS, S3, GCS, ABFS, local)
    • Renames Hive-style table_name=xxx directories to plain xxx for NDS pipeline compatibility
    • Supports --range for incremental generation, --overwrite, and --update for maintenance datasets

Dockerfiles

  • nds/tpcds-gen/Dockerfile.dsdgen — Multi-stage build for cross-compiling dsdgen on Linux (handles GCC 10+ -fcommon issue)
  • nds/Dockerfile.k8s-test — Layers datagen scripts onto the official Spark Python image (recommended, uses ARG BASE_IMAGE)
  • nds/Dockerfile.spark-k8s — Standalone full Spark + PySpark image from a Spark distribution (build context = $SPARK_HOME)

Configuration

  • nds/datagen_submit.template — spark-submit template for use with spark-submit-template utility

Documentation

  • nds/README.md — Comprehensive new sections:
    • Spark-based data generation usage and arguments
    • K8s + HDFS prerequisites checklist
    • Cross-compiling dsdgen for Linux with Dockerfile.dsdgen
    • Comparison table of two Dockerfile options with step-by-step build instructions
    • Local K8s testing guide with minikube (PVC-based, fully manual steps)

Testing

Verified on local minikube (Docker driver, ARM64):

  1. Built dsdgen for Linux ARM64 via Dockerfile.dsdgen
  2. Built Spark K8s image with Dockerfile.k8s-test
  3. Ran nds_gen_data_spark.py with local[2] mode inside a K8s pod (scale=1, parallel=2)
  4. All 25 TPC-DS source tables generated successfully (1.2 GB total)
  5. Directory structure verified: plain table names (not Hive-style partitions)
  6. Data content verified: pipe-delimited format matching TPC-DS specification

Migration Notes

  • The existing nds_gen_data.py (MapReduce-based) is unchanged — this PR adds a parallel path, not a replacement
  • For K8s environments or clusters without MapReduce, use nds_gen_data_spark.py
  • For YARN environments with MapReduce available, either approach works

Signed-off-by: Allen Xu <allxu@nvidia.com>
Copilot AI review requested due to automatic review settings February 13, 2026 06:38
@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented Feb 13, 2026

Greptile Summary

This PR adds K8s-native TPC-DS data generation support via a new PySpark-based approach (nds_gen_data_spark.py) that replaces Hadoop MapReduce. The implementation distributes the dsdgen binary across Spark executors using the --archives mechanism and supports any Hadoop-compatible filesystem (HDFS, S3, GCS, ABFS, local).

Key additions:

  • nds_gen_data_spark.py — PySpark script for distributed data generation with incremental generation support via --range flag
  • Dockerfile.spark-k8s and Dockerfile.k8s-test — Two Docker build options for K8s deployment (standalone vs. layered)
  • Dockerfile.dsdgen — Cross-compilation support for building Linux dsdgen binaries from macOS
  • datagen_submit.template — Spark-submit template configuration
  • Comprehensive K8s documentation with minikube testing walkthrough

Review findings:
The code addresses most previous review feedback (py4j symlink error handling, PVC wait command, documentation improvements). A few minor items from previous reviews remain unaddressed but are not critical. The implementation is well-structured with good error handling and memory efficiency (streaming line-by-line processing).

Confidence Score: 4/5

  • Safe to merge with minor items from previous reviews still outstanding
  • Score reflects solid implementation quality with comprehensive documentation and testing instructions. Most critical issues from previous reviews have been addressed (py4j error handling, PVC waits, documentation). Remaining items are minor (casing conventions, archive naming consistency) and do not affect functionality. The core Spark-based generation logic is sound with proper error handling, memory-efficient streaming, and good validation.
  • Pay attention to nds/nds_gen_data_spark.py — ensure previous review comments about errorifexists casing and archive format consistency are addressed before final merge

Important Files Changed

Filename Overview
nds/nds_gen_data_spark.py New PySpark-based data generation script; previous review comments on errorifexists casing and archive format mismatches, but core logic is sound with good error handling
nds/Dockerfile.spark-k8s Standalone Spark + PySpark image with improved py4j symlink error handling from previous review feedback; well-documented and production-ready
nds/Dockerfile.k8s-test Simple layered image for quick setup; includes helpful build instructions in comments addressing previous review feedback
nds/README.md Comprehensive K8s data generation documentation with minikube testing instructions; includes PVC wait command from previous review feedback

Sequence Diagram

sequenceDiagram
    participant User
    participant Driver as Spark Driver<br/>(K8s Pod)
    participant Executor as Spark Executor<br/>(K8s Pod)
    participant FS as Filesystem<br/>(HDFS/S3/etc)
    
    User->>Driver: spark-submit with --archives dsdgen.tar.gz#dsdgen
    Driver->>Driver: Parse args (scale, parallel, range)
    Driver->>Driver: Create RDD with child indices
    Driver->>Executor: Distribute archive + tasks
    
    Executor->>Executor: Extract dsdgen.tar.gz to SparkFiles
    Executor->>Executor: run_dsdgen_and_read(child_index)
    Executor->>Executor: subprocess.run(dsdgen -child N)
    Executor->>Executor: Stream .dat files line-by-line
    Executor->>FS: Write partitioned data (table_name=xxx/)
    
    Executor-->>Driver: Task complete
    
    Driver->>Driver: All tasks finished
    Driver->>FS: rename_partition_dirs()
    Driver->>FS: Rename table_name=xxx to xxx/
    FS-->>Driver: Rename complete
    
    Driver->>User: Data generation complete
Loading

Last reviewed commit: 3160717

Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 files reviewed, 7 comments

Edit Code Review Agent Settings | Greptile

Comment thread nds/nds_gen_data_spark.py Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a new Spark-based approach for TPC-DS data generation that works with Kubernetes and other Spark cluster managers, complementing the existing Hadoop MapReduce-based approach. The implementation distributes the dsdgen binary via Spark's --archives mechanism and generates data in parallel across Spark executors.

Changes:

  • Added nds_gen_data_spark.py: PySpark application that replaces MapReduce approach for data generation, supporting K8s, YARN, Standalone, and local modes
  • Updated README with documentation and examples for the new Spark-based data generation workflow

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.

File Description
nds/nds_gen_data_spark.py New PySpark script implementing distributed TPC-DS data generation via Spark executors with support for incremental generation and multiple filesystem backends
nds/README.md Added comprehensive documentation section explaining Spark-based data generation with usage examples for K8s and other cluster managers

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread nds/nds_gen_data_spark.py Outdated
Comment thread nds/README.md Outdated
Comment thread nds/nds_gen_data_spark.py Outdated
Comment thread nds/nds_gen_data_spark.py Outdated
Comment thread nds/nds_gen_data_spark.py Outdated
Comment thread nds/nds_gen_data_spark.py Outdated
Signed-off-by: Allen Xu <allxu@nvidia.com>
Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, 7 comments

Edit Code Review Agent Settings | Greptile

Comment thread nds/nds_gen_data_spark.py Outdated
Comment thread nds/nds_gen_data_spark.py Outdated
Comment thread nds/nds_gen_data_spark.py Outdated
Comment thread nds/nds_gen_data_spark.py Outdated
Comment thread nds/nds_gen_data_spark.py Outdated
Comment thread nds/nds_gen_data_spark.py Outdated
Comment thread nds/nds_gen_data_spark.py Outdated
Signed-off-by: Allen Xu <allxu@nvidia.com>
Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment thread nds/nds_gen_data_spark.py
- Use dsdgen.tar.gz instead of dsdgen.jar for archive format consistency
- Add return value checks in rename_partition_dirs to prevent silent failures
- Use shutil.rmtree for robust temp directory cleanup
- Add encoding="utf-8" to file open calls
- Fix Partitions print alignment

Made-with: Cursor
Comment thread nds/nds_gen_data_spark.py Outdated
if len(parts) != 2:
print("ERROR: --range must be 'start,end'", file=sys.stderr)
sys.exit(1)
range_start, range_end = int(parts[0]), int(parts[1])
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing exception handling for int() conversion - passing non-numeric values like --range abc,def will crash with unclear ValueError

Suggested change
range_start, range_end = int(parts[0]), int(parts[1])
try:
range_start, range_end = int(parts[0]), int(parts[1])
except ValueError as e:
print(f"ERROR: --range values must be integers: {e}", file=sys.stderr)
sys.exit(1)

- Add tpcds-gen/Dockerfile.dsdgen for cross-compiling dsdgen on Linux
- Add Dockerfile.k8s-test for building Spark K8s image with datagen
- Remove k8s_datagen_smoketest.sh script
- Update README with cross-compilation, image build, and minikube
  testing instructions

Made-with: Cursor
Comment thread nds/Dockerfile.spark-k8s Outdated
# Set up Spark directory structure
ENV SPARK_HOME=/opt/spark
ENV PATH="${SPARK_HOME}/bin:${PATH}"
ENV PYTHONPATH="${SPARK_HOME}/python:${SPARK_HOME}/python/lib/py4j-0.10.9.7-src.zip:${PYTHONPATH}"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hardcoded py4j-0.10.9.7-src.zip version assumes Spark 3.5.x - will break with other Spark versions that bundle different py4j versions

Comment thread nds/Dockerfile.k8s-test Outdated
Comment thread nds/README.md
- Add try/except for --range int() parsing with clear error message
- Remove hardcoded py4j version in Dockerfile.spark-k8s using symlink
- Add base image build instructions and ARG to Dockerfile.k8s-test
- Add kubectl wait for PVC binding in README minikube guide

Made-with: Cursor
Comment thread nds/Dockerfile.spark-k8s Outdated
Comment thread nds/Dockerfile.spark-k8s Outdated
- Add guard in Dockerfile.spark-k8s to fail build if no py4j zip found
- Add usage instructions and build context note to Dockerfile.spark-k8s
- Document both Dockerfile options (k8s-test vs spark-k8s) in README
  with comparison table and step-by-step instructions

Made-with: Cursor
@wjxiz1992 wjxiz1992 changed the title [draft]init support for k8s data gen Init support for k8s data gen Feb 27, 2026
@wjxiz1992 wjxiz1992 changed the title Init support for k8s data gen Add Spark-based data generation for K8s (replaces MapReduce) Feb 27, 2026
@wjxiz1992
Copy link
Copy Markdown
Collaborator Author

@copilot code review[agent] review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants