Skip to content

feat: Add Spark Operator for running Spark jobs on Kubernetes#2

Merged
klagrida merged 1 commit intomainfrom
feat/add-spark-operator
Dec 14, 2025
Merged

feat: Add Spark Operator for running Spark jobs on Kubernetes#2
klagrida merged 1 commit intomainfrom
feat/add-spark-operator

Conversation

@klagrida
Copy link
Copy Markdown
Contributor

Implement Kubeflow Spark Operator for distributed data processing:

Infrastructure:

  • Helm values configuration (core/compute/spark-operator-values.yaml)
  • Kubeflow Spark Operator with webhook enabled
  • Resource limits: 128Mi/256Mi memory, 100m/200m CPU
  • RBAC and service account configuration
  • Prometheus metrics integration

Installation:

  • Installation script (scripts/install-spark.sh)
  • Uses Kubeflow Spark Operator Helm chart
  • Automatic namespace and RBAC setup
  • Webhook for pod customization

Developer Experience:

  • Makefile targets: install-spark, logs-spark, spark-apps
  • Enhanced status command with Spark operator and applications
  • Easy Spark job submission and monitoring

Examples:

  • spark-pi.yaml: Basic Spark Pi calculation
  • spark-minio-example.yaml: Spark with MinIO S3 integration
  • example_spark_job.py: Airflow DAG for Spark job orchestration
  • Comprehensive README with usage examples

Integration:

  • SparkApplication CRD for declarative job submission
  • S3A configuration for MinIO integration
  • Airflow orchestration support via KubernetesPodOperator
  • Driver and executor pod templates

Features:

  • Declarative Spark application management
  • Automatic driver/executor pod creation
  • Resource quota and limits
  • Job monitoring and lifecycle management
  • S3-compatible storage (MinIO) access

Spark Operator enables:

  • Batch data processing at scale
  • ETL/ELT pipelines
  • Machine learning workloads
  • Real-time stream processing (with Structured Streaming)
  • SQL analytics on distributed data

🤖 Generated with Claude Code

Implement Kubeflow Spark Operator for distributed data processing:

**Infrastructure:**
- Helm values configuration (core/compute/spark-operator-values.yaml)
- Kubeflow Spark Operator with webhook enabled
- Resource limits: 128Mi/256Mi memory, 100m/200m CPU
- RBAC and service account configuration
- Prometheus metrics integration

**Installation:**
- Installation script (scripts/install-spark.sh)
- Uses Kubeflow Spark Operator Helm chart
- Automatic namespace and RBAC setup
- Webhook for pod customization

**Developer Experience:**
- Makefile targets: install-spark, logs-spark, spark-apps
- Enhanced status command with Spark operator and applications
- Easy Spark job submission and monitoring

**Examples:**
- spark-pi.yaml: Basic Spark Pi calculation
- spark-minio-example.yaml: Spark with MinIO S3 integration
- example_spark_job.py: Airflow DAG for Spark job orchestration
- Comprehensive README with usage examples

**Integration:**
- SparkApplication CRD for declarative job submission
- S3A configuration for MinIO integration
- Airflow orchestration support via KubernetesPodOperator
- Driver and executor pod templates

**Features:**
- Declarative Spark application management
- Automatic driver/executor pod creation
- Resource quota and limits
- Job monitoring and lifecycle management
- S3-compatible storage (MinIO) access

Spark Operator enables:
- Batch data processing at scale
- ETL/ELT pipelines
- Machine learning workloads
- Real-time stream processing (with Structured Streaming)
- SQL analytics on distributed data

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@klagrida klagrida merged commit d6ef85a into main Dec 14, 2025
3 checks passed
@klagrida klagrida deleted the feat/add-spark-operator branch December 14, 2025 12:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant