Skip to content

Build Databricks Docker image for Container Services #48

@srnnkls

Description

@srnnkls

Overview

Build a Docker image for running getml on Databricks Container Services. The image will support on-demand cluster execution for feature training, retraining, and interactive notebook workflows.

User Story

As a data engineer deploying getml on Databricks,
I want a production-ready Docker image for Databricks Container Services,
So that I can run getml feature engineering jobs natively within my Databricks clusters.

Technical Approach

1. Dockerfile

Create docker/databricks/Dockerfile extending the Databricks runtime:

# Extend Databricks standard runtime for LTS compatibility
FROM databricksruntime/standard:16.4-LTS

# Install uv for fast, reliable dependency management
COPY --from=ghcr.io/astral-sh/uv:latest /uv /usr/local/bin/uv

# Copy dependency files and install with uv
COPY pyproject.toml uv.lock ./
RUN uv pip install --system getml

# Download getml engine
ARG GETML_VERSION
RUN GETML_VERSION=${GETML_VERSION:-$(pip show getml | grep Version | cut -d' ' -f2 || echo "1.5.0")} && \
    mkdir -p /opt/getml/.getML && \
    curl -L "https://go.getml.com/static/demo/download/${GETML_VERSION}/getml-${GETML_VERSION}-x64-linux.tar.gz" | \
    tar -C /opt/getml/.getML -xzf -

# Copy application code (optional, can mount from workspace)
COPY . /opt/getml/app/

# IMPORTANT: CMD and ENTRYPOINT are IGNORED by Databricks
# Databricks controls execution; use init scripts for startup tasks

Key design decisions:

  • No CMD/ENTRYPOINT: Databricks ignores Docker execution primitives entirely
  • uv for dependencies: Faster, more reliable than pip
  • Extends official runtime: Required for Databricks compatibility (includes Spark, JDK, etc.)
  • Engine in /opt/getml: Accessible system-wide for all users

2. Init Script

Create docker/databricks/getml-init.sh for cluster startup.

Why init scripts? Databricks ignores Docker ENTRYPOINT/CMD. Init scripts run after container creation but before cluster becomes operational.

#!/bin/bash
# getml-init.sh - Databricks init script for getml setup
# This runs on every cluster node at startup

set -e

# Configure getml engine location
export GETML_HOME="/opt/getml/.getML"
export PATH="${GETML_HOME}:${PATH}"

# Persist environment variables for notebooks
echo "export GETML_HOME=${GETML_HOME}" >> /etc/profile.d/getml.sh
echo "export PATH=${GETML_HOME}:\${PATH}" >> /etc/profile.d/getml.sh

# Create project directory on DBFS (shared storage)
mkdir -p /dbfs/getml/projects

# Log initialization
echo "========================================="
echo "getml initialized successfully"
echo "Engine: ${GETML_HOME}"
echo "Projects: /dbfs/getml/projects"
echo "========================================="

3. GitHub Actions Workflow

Create .github/workflows/databricks-docker.yml:

name: Build and Push Databricks Docker Image

on:
  release:
    types: [published]
  workflow_dispatch:
    inputs:
      version:
        description: 'getml version'
        required: true

jobs:
  build-and-push:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3

      - name: Login to Docker Hub
        uses: docker/login-action@v3
        with:
          username: ${{ secrets.DOCKERHUB_USERNAME }}
          password: ${{ secrets.DOCKERHUB_TOKEN }}

      - name: Build and Push
        run: |
          VERSION=${{ github.event.inputs.version || github.ref_name }}
          IMAGE="getml/getml-databricks"

          docker build \
            --build-arg GETML_VERSION=${VERSION} \
            -t ${IMAGE}:${VERSION} \
            -t ${IMAGE}:latest \
            -f docker/databricks/Dockerfile .

          docker push ${IMAGE}:${VERSION}
          docker push ${IMAGE}:latest

4. Alternative: GCR Workflow

For GCP-based Databricks workspaces using Google Container Registry:

      - name: Authenticate to Google Cloud
        uses: google-github-actions/auth@v2
        with:
          credentials_json: ${{ secrets.GCP_SA_KEY }}

      - name: Configure Docker for GCR
        run: gcloud auth configure-docker gcr.io

      - name: Build and Push to GCR
        run: |
          VERSION=${{ github.event.inputs.version || github.ref_name }}
          IMAGE="gcr.io/${{ secrets.GCP_PROJECT_ID }}/getml-databricks"

          docker build \
            --build-arg GETML_VERSION=${VERSION} \
            -t ${IMAGE}:${VERSION} \
            -t ${IMAGE}:latest \
            -f docker/databricks/Dockerfile .

          docker push ${IMAGE}:${VERSION}
          docker push ${IMAGE}:latest

5. Deployment Documentation

Create docker/databricks/README.md with:

  • Prerequisites (workspace admin, container services enabled)
  • Cluster configuration steps
  • Init script installation
  • Usage examples
  • Troubleshooting guide

Files to create:

docker/databricks/
├── Dockerfile
├── pyproject.toml      # Dependencies (uv)
├── getml-init.sh       # Cluster init script
└── README.md
.github/workflows/
└── databricks-docker.yml

Cluster Configuration

To use the custom container on Databricks:

  1. Enable Container Services (workspace admin):

    • Admin Console > Advanced > Enable Databricks Container Services
  2. Create cluster with custom image:

    {
      "cluster_name": "getml-cluster",
      "spark_version": "16.4.x-scala2.12",
      "docker_image": {
        "url": "getml/getml-databricks:latest"
      },
      "init_scripts": [
        {
          "workspace": {
            "destination": "/Shared/init-scripts/getml-init.sh"
          }
        }
      ]
    }
  3. Or via UI:

    • Create Cluster > Docker > Use your own Docker container
    • Enter image URL: getml/getml-databricks:latest

Acceptance Criteria

  • Dockerfile builds successfully extending databricksruntime
  • Image pushes to Docker Hub (or ECR) via CI/CD
  • Init script runs successfully on cluster startup
  • getml engine starts and can process data
  • Feature training workflow completes successfully
  • Feature retraining workflow completes successfully
  • Documentation covers setup and usage

Testing

  1. Local build test: docker build -t getml-databricks .
  2. Cluster test: Create cluster with custom image
  3. Training test: Run sample feature training notebook
  4. Retraining test: Verify retraining with new data

Constraints & Limitations

  • CMD/ENTRYPOINT IGNORED: Databricks completely ignores Docker execution primitives - use init scripts for startup tasks
  • Must extend official runtime: Custom images must extend databricksruntime/* base images
  • IP range conflict: Avoid using 172.17.0.0/16 in container networking
  • Not supported on: Standard access mode, ML Runtime, AWS Graviton instances
  • Rate limits: Docker Hub has pull rate limits; use GCR for high-volume usage
  • Init script execution: Runs after container creation, before cluster is operational

Dependencies

  • Databricks workspace with Container Services enabled (admin setting)
  • Container registry (Docker Hub, ECR, or ACR)
  • Docker CLI for local testing

Example Usage

# Build locally
docker build -t getml-databricks -f docker/databricks/Dockerfile .

# Push to Docker Hub
docker tag getml-databricks getml/getml-databricks:latest
docker push getml/getml-databricks:latest

# Configure cluster via Databricks UI
# Create Cluster > Docker > Use your own Docker container
# Image URL: getml/getml-databricks:latest
# Init script: /Shared/init-scripts/getml-init.sh

Implementation Breakdown

This feature will be decomposed into the following tasks (to be created as sub-issues):

  1. Create Dockerfile extending databricksruntime/standard
  2. Create pyproject.toml with getml dependencies
  3. Create getml-init.sh cluster init script
  4. Create GitHub Actions workflow for CI/CD (Docker Hub)
  5. Create alternative GCR workflow (optional)
  6. Write deployment documentation (README.md)
  7. Test end-to-end on Databricks cluster

Related Issues

Documentation

Metadata

Metadata

Assignees

No fields configured for Feature.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions