Skip to content

thisisqubika/data-migration-accelerator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

169 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data Migration Accelerator

A project for testing and validating Snowflake to Databricks migration tools and extractors.

Overview

This project contains Snowflake test objects that can be used to validate extractors and conversion tools that migrate Snowflake database objects to Databricks.


Getting Started for New Clients

This section guides new clients through setting up the Data Migration Accelerator for their own projects.

Repository Setup: Fork the Repository

Forking creates a copy under your GitHub organization while maintaining a connection to the original repository for future updates.

Step 1: Fork the Repository

  1. Navigate to the original repository on GitHub
  2. Click Fork in the top-right corner
  3. Select your organization as the destination
  4. Uncheck "Copy the main branch only" if you want all branches

Step 2: Clone Your Fork

git clone https://github.com/YOUR_ORG/data-migration-accelerator.git
cd data-migration-accelerator

Step 3: Configure Upstream Remote

# Add the original repository as "upstream"
git remote add upstream https://github.com/thisisqubika/data-migration-accelerator.git

# Verify remotes
git remote -v
# origin    https://github.com/YOUR_ORG/data-migration-accelerator.git (fetch)
# origin    https://github.com/YOUR_ORG/data-migration-accelerator.git (push)
# upstream  https://github.com/thisisqubika/data-migration-accelerator.git (fetch)
# upstream  https://github.com/thisisqubika/data-migration-accelerator.git (push)

Step 4: Create a Client Branch

Keep your customizations separate from main for easier upstream merges:

git checkout -b client/your-company-name

Pulling Upstream Updates

When the original accelerator has updates you want to incorporate:

# Fetch upstream changes
git fetch upstream

# Merge upstream main into your branch
git checkout main
git merge upstream/main

# Push updated main to your fork
git push origin main

# Rebase your client branch on updated main
git checkout client/your-company-name
git rebase main

Contributing Back

If you make improvements that could benefit others:

  1. Create a feature branch from main
  2. Make your changes
  3. Push to your fork
  4. Open a Pull Request to the upstream repository

Post-Setup Checklist

After forking the repository:

  • Update databricks.yml with your bundle name
  • Configure GitHub Secrets for CI/CD (see GitHub Secrets)
  • Set up Databricks Secrets scope (see Databricks Secrets)
  • Configure cluster environment variables (see Cluster Environment Variables)
  • Update env.example with your default values
  • Create the required Databricks group (DEVS_GROUP)
  • Test the deployment pipeline

Files

  • snowflake_test_objects.sql - Contains sample Snowflake objects (tables, views, procedures, functions, etc.) with the data_migration naming convention for testing migration tools.
  • snowpark.py - Python script using Snowpark API to read and extract all Snowflake objects from the database.
  • CONFIGURATION.md - Detailed guide on configuring Snowflake credentials.

Usage

  1. Execute snowflake_test_objects.sql in your Snowflake environment to create the test objects
  2. Use your migration tool/extractor to convert these Snowflake objects to Databricks
  3. Validate the converted objects in Databricks

Reading Snowflake Objects with Python

To read all Snowflake objects programmatically using Snowpark:

Setup

  1. Set up virtual environment (recommended):

    # Create virtual environment
    python3 -m venv venv
    
    # Activate virtual environment
    # On macOS/Linux:
    source venv/bin/activate
    # On Windows:
    # venv\Scripts\activate
    
    # Install dependencies
    pip install -r requirements.txt

    Note: The virtual environment has already been created and dependencies installed. To activate it:

    source venv/bin/activate
  2. Configure credentials:

    • Copy the example environment file:
      cp env.example .env
    • Edit .env with your Snowflake credentials (see CONFIGURATION.md for details):
      • Required: SNOWFLAKE_ACCOUNT, SNOWFLAKE_USER, SNOWFLAKE_PASSWORD
      • Optional: SNOWFLAKE_DATABASE, SNOWFLAKE_SCHEMA, SNOWFLAKE_WAREHOUSE, SNOWFLAKE_ROLE, SNOWFLAKE_REGION

Using snowpark.py

# Make sure virtual environment is activated
source venv/bin/activate  # On macOS/Linux

# Run the script
python snowpark.py

Features:

  • Uses Snowpark API for data processing
  • Uses password authentication
  • Reads all objects (tables, views, procedures, functions, sequences, stages, file formats, tasks, streams, pipes)
  • Includes sample data from tables (first 10 rows)
  • Queries specific test objects from snowflake_test_objects.sql
  • Displays a summary
  • Saves results to snowflake_objects_snowpark.json

Test Objects

The SQL file includes:

  • Tables: data_migration_source, data_migration_target
  • Views: Various views for data migration summaries and status
  • Stored Procedures: Procedures for querying migration data
  • User-Defined Functions: Scalar and table functions
  • Other Objects: Sequences, stages, file formats, tasks, streams, and pipes

Configuration

  • Database: DATA_MIGRATION_DB
  • Schema: DATA_MIGRATION_SCHEMA
  • Data Retention: 1 day (max retention)

Credentials Configuration

For detailed instructions on configuring Snowflake credentials, see CONFIGURATION.md.

Quick Setup:

  1. Copy the example environment file:

    cp env.example .env
  2. Edit .env with your credentials:

    SNOWFLAKE_ACCOUNT=your_account_identifier
    SNOWFLAKE_USER=your_username
    SNOWFLAKE_PASSWORD=your_password
    SNOWFLAKE_DATABASE=your_database    # Required
    SNOWFLAKE_SCHEMA=your_schema        # Required
    SNOWFLAKE_WAREHOUSE=COMPUTE_WH      # Optional
    SNOWFLAKE_ROLE=SYSADMIN             # Optional
    
    # Unity Catalog - Required
    UC_CATALOG=your_catalog_name
    UC_SCHEMA=migration_accelerator
  3. The .env file is already in .gitignore to protect your credentials

Databricks Deployment

To run on Databricks, configure the following:

Databricks Secrets

Create a secrets scope and add credentials. These secrets are used at runtime by the Databricks jobs:

databricks secrets create-scope migration-accelerator
databricks secrets put-secret migration-accelerator SNOWFLAKE_ACCOUNT
databricks secrets put-secret migration-accelerator SNOWFLAKE_USER
databricks secrets put-secret migration-accelerator SNOWFLAKE_PASSWORD
databricks secrets put-secret migration-accelerator DATABRICKS_HOST
databricks secrets put-secret migration-accelerator DATABRICKS_CLIENT_ID
databricks secrets put-secret migration-accelerator DATABRICKS_CLIENT_SECRET

Note: The Snowflake credentials (SNOWFLAKE_ACCOUNT, SNOWFLAKE_USER, SNOWFLAKE_PASSWORD) are required for the ingestion job to connect to Snowflake. The Databricks credentials are used by the Job Executor App for API authentication.

Cluster Environment Variables

Set in Cluster → Advanced Options → Spark → Environment Variables:

# Required - Unity Catalog configuration
UC_CATALOG=your_catalog_name
UC_SCHEMA=migration_accelerator

# Required - Snowflake source context
SNOWFLAKE_DATABASE=your_database
SNOWFLAKE_SCHEMA=your_schema

# Required - Translation output configuration
DDL_OUTPUT_DIR=/Volumes/your_catalog/migration_accelerator/outputs
DBX_ENDPOINT=databricks-llama-4-maverick

# Optional - Override defaults if needed
# SECRETS_SCOPE=migration-accelerator    # Default: migration-accelerator
# UC_RAW_VOLUME=snowflake_artifacts_raw  # Default: snowflake_artifacts_raw
# SNOWFLAKE_WAREHOUSE=COMPUTE_WH         # Default: COMPUTE_WH
# SNOWFLAKE_ROLE=SYSADMIN                # Default: SYSADMIN

GitHub Secrets (for CI/CD)

These secrets are used by GitHub Actions to deploy the Databricks Asset Bundle (not at runtime):

Secret Description
DATABRICKS_HOST Workspace URL (e.g., https://your-workspace.cloud.databricks.com)
DATABRICKS_CLIENT_ID Service principal OAuth M2M client ID
DATABRICKS_CLIENT_SECRET Service principal OAuth M2M client secret
DATABRICKS_CLUSTER_ID Existing all-purpose cluster ID for running job tasks
UC_CATALOG Unity Catalog name for schema and volume creation
DEVS_GROUP Databricks group name for job and catalog permissions

Note: The DEVS_GROUP (e.g., migration-accelerator-devs) must exist in Databricks before deployment. Create it in Admin Console → Groups → Create Group.

Secrets vs GitHub Secrets: Databricks Secrets (in the scope) are read at runtime by the jobs. GitHub Secrets are used at deploy time by the CI/CD pipeline to authenticate and configure the bundle.

After deployment

Once deployed, get the service principal name from the Databricks App in Compute->Apps->dbx-job-executor-app->Authorization->App Authorization and th job id from Jobs & Pipelines->snowflake_ingestion_job->Job Details->Job ID. Then add this Service Principal to the developers permission group specified in the variable DEVS_GROUP in the Github Secrets.

Handle Results

The results are stored in /Volumes/<databricks_host>/migration_accelerator/outputs/, these are the SQL files that will create the databricks artifacts once ran.

The recommended order of SQL files to run is: Roles → Stages → Tables → Streams → Pipes → Views → UDFs → Procedures → Grants

Run Locally (translation job)

For a stylish, repeatable local run, use the included helper script or Makefile target.

Quick (recommended):

# run via make (creates timestamped output dir)
make translate

Direct script (more control):

# run all example input files, batch size 2, produce SQL files
./scripts/run_translation.sh

# pass a custom glob, batch size, or output format (json/sql)
./scripts/run_translation.sh "src/artifact_translation_package/examples/*.json" 4 json

Notes:

  • make translate invokes the script at scripts/run_translation.sh and is the cleanest option.
  • The script exports PYTHONPATH=src so you can run it from the repository root.
  • Output is written to src/artifact_translation_package/out_sql_examples_<timestamp>.

Local vs Databricks output paths

This project uses Databricks-style paths (for example dbfs:/...) as the canonical configuration. To make the same code run locally without changing the canonical config, the runner maps Databricks paths to a local directory when it detects a non-Databricks runtime.

  • Local mapping environment variable: set LOCAL_DBFS_MOUNT to the local directory that should act as the root for dbfs:/ paths. Default: ./ddl_output.
  • Per-run results_dir: when running the translation job locally (for example via make translate), the job pre-creates a timestamped results_dir and propagates it to the translation context. Evaluation results are written under <results_dir>/evaluation_results/. Translation outputs such as translation_results.json and results_summary.json are written into the same results_dir.
  • Fallback behavior: the runner also exports DDL_OUTPUT_DIR=<results_dir> so code paths that read the DDL_OUTPUT_DIR environment variable will also target the per-run folder.

Examples

# run with default local mapping (output under ./ddl_output by default)
make translate

# override where dbfs:/ maps to locally
export LOCAL_DBFS_MOUNT=/tmp/my_local_dbfs
make translate

# inspect latest run outputs
ls -la src/artifact_translation_package/out_sql_examples_*/
ls -la <that-run-folder>/evaluation_results/

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •