Data Migration Accelerator

A project for testing and validating Snowflake to Databricks migration tools and extractors.

Overview

This project contains Snowflake test objects that can be used to validate extractors and conversion tools that migrate Snowflake database objects to Databricks.

Getting Started for New Clients

This section guides new clients through setting up the Data Migration Accelerator for their own projects.

Repository Setup: Fork the Repository

Forking creates a copy under your GitHub organization while maintaining a connection to the original repository for future updates.

Step 1: Fork the Repository

Navigate to the original repository on GitHub
Click Fork in the top-right corner
Select your organization as the destination
Uncheck "Copy the main branch only" if you want all branches

Step 2: Clone Your Fork

git clone https://github.com/YOUR_ORG/data-migration-accelerator.git
cd data-migration-accelerator

Step 3: Configure Upstream Remote

# Add the original repository as "upstream"
git remote add upstream https://github.com/thisisqubika/data-migration-accelerator.git

# Verify remotes
git remote -v
# origin    https://github.com/YOUR_ORG/data-migration-accelerator.git (fetch)
# origin    https://github.com/YOUR_ORG/data-migration-accelerator.git (push)
# upstream  https://github.com/thisisqubika/data-migration-accelerator.git (fetch)
# upstream  https://github.com/thisisqubika/data-migration-accelerator.git (push)

Step 4: Create a Client Branch

Keep your customizations separate from main for easier upstream merges:

git checkout -b client/your-company-name

Pulling Upstream Updates

When the original accelerator has updates you want to incorporate:

# Fetch upstream changes
git fetch upstream

# Merge upstream main into your branch
git checkout main
git merge upstream/main

# Push updated main to your fork
git push origin main

# Rebase your client branch on updated main
git checkout client/your-company-name
git rebase main

Contributing Back

If you make improvements that could benefit others:

Create a feature branch from main
Make your changes
Push to your fork
Open a Pull Request to the upstream repository

Post-Setup Checklist

After forking the repository:

Update databricks.yml with your bundle name
Configure GitHub Secrets for CI/CD (see GitHub Secrets)
Set up Databricks Secrets scope (see Databricks Secrets)
Configure cluster environment variables (see Cluster Environment Variables)
Update env.example with your default values
Create the required Databricks group (DEVS_GROUP)
Test the deployment pipeline

Files

snowflake_test_objects.sql - Contains sample Snowflake objects (tables, views, procedures, functions, etc.) with the data_migration naming convention for testing migration tools.
snowpark.py - Python script using Snowpark API to read and extract all Snowflake objects from the database.
CONFIGURATION.md - Detailed guide on configuring Snowflake credentials.

Usage

Execute snowflake_test_objects.sql in your Snowflake environment to create the test objects
Use your migration tool/extractor to convert these Snowflake objects to Databricks
Validate the converted objects in Databricks

Reading Snowflake Objects with Python

To read all Snowflake objects programmatically using Snowpark:

Setup

Set up virtual environment (recommended):

# Create virtual environment
python3 -m venv venv

# Activate virtual environment
# On macOS/Linux:
source venv/bin/activate
# On Windows:
# venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Note: The virtual environment has already been created and dependencies installed. To activate it:

source venv/bin/activate

Configure credentials:
- Copy the example environment file:
```
cp env.example .env
```
- Edit .env with your Snowflake credentials (see CONFIGURATION.md for details):
  - Required: SNOWFLAKE_ACCOUNT, SNOWFLAKE_USER, SNOWFLAKE_PASSWORD
  - Optional: SNOWFLAKE_DATABASE, SNOWFLAKE_SCHEMA, SNOWFLAKE_WAREHOUSE, SNOWFLAKE_ROLE, SNOWFLAKE_REGION

Using snowpark.py

# Make sure virtual environment is activated
source venv/bin/activate  # On macOS/Linux

# Run the script
python snowpark.py

Features:

Uses Snowpark API for data processing
Uses password authentication
Reads all objects (tables, views, procedures, functions, sequences, stages, file formats, tasks, streams, pipes)
Includes sample data from tables (first 10 rows)
Queries specific test objects from snowflake_test_objects.sql
Displays a summary
Saves results to snowflake_objects_snowpark.json

Test Objects

The SQL file includes:

Tables: data_migration_source, data_migration_target
Views: Various views for data migration summaries and status
Stored Procedures: Procedures for querying migration data
User-Defined Functions: Scalar and table functions
Other Objects: Sequences, stages, file formats, tasks, streams, and pipes

Configuration

Database: DATA_MIGRATION_DB
Schema: DATA_MIGRATION_SCHEMA
Data Retention: 1 day (max retention)

Credentials Configuration

For detailed instructions on configuring Snowflake credentials, see CONFIGURATION.md.

Quick Setup:

Copy the example environment file:
```
cp env.example .env
```

Edit .env with your credentials:

SNOWFLAKE_ACCOUNT=your_account_identifier
SNOWFLAKE_USER=your_username
SNOWFLAKE_PASSWORD=your_password
SNOWFLAKE_DATABASE=your_database    # Required
SNOWFLAKE_SCHEMA=your_schema        # Required
SNOWFLAKE_WAREHOUSE=COMPUTE_WH      # Optional
SNOWFLAKE_ROLE=SYSADMIN             # Optional

# Unity Catalog - Required
UC_CATALOG=your_catalog_name
UC_SCHEMA=migration_accelerator

The .env file is already in .gitignore to protect your credentials

Databricks Deployment

To run on Databricks, configure the following:

Databricks Secrets

Create a secrets scope and add credentials. These secrets are used at runtime by the Databricks jobs:

databricks secrets create-scope migration-accelerator
databricks secrets put-secret migration-accelerator SNOWFLAKE_ACCOUNT
databricks secrets put-secret migration-accelerator SNOWFLAKE_USER
databricks secrets put-secret migration-accelerator SNOWFLAKE_PASSWORD
databricks secrets put-secret migration-accelerator DATABRICKS_HOST
databricks secrets put-secret migration-accelerator DATABRICKS_CLIENT_ID
databricks secrets put-secret migration-accelerator DATABRICKS_CLIENT_SECRET

Note: The Snowflake credentials (SNOWFLAKE_ACCOUNT, SNOWFLAKE_USER, SNOWFLAKE_PASSWORD) are required for the ingestion job to connect to Snowflake. The Databricks credentials are used by the Job Executor App for API authentication.

Cluster Environment Variables

Set in Cluster → Advanced Options → Spark → Environment Variables:

# Required - Unity Catalog configuration
UC_CATALOG=your_catalog_name
UC_SCHEMA=migration_accelerator

# Required - Snowflake source context
SNOWFLAKE_DATABASE=your_database
SNOWFLAKE_SCHEMA=your_schema

# Required - Translation output configuration
DDL_OUTPUT_DIR=/Volumes/your_catalog/migration_accelerator/outputs
DBX_ENDPOINT=databricks-llama-4-maverick

# Optional - Override defaults if needed
# SECRETS_SCOPE=migration-accelerator    # Default: migration-accelerator
# UC_RAW_VOLUME=snowflake_artifacts_raw  # Default: snowflake_artifacts_raw
# SNOWFLAKE_WAREHOUSE=COMPUTE_WH         # Default: COMPUTE_WH
# SNOWFLAKE_ROLE=SYSADMIN                # Default: SYSADMIN

GitHub Secrets (for CI/CD)

These secrets are used by GitHub Actions to deploy the Databricks Asset Bundle (not at runtime):

Secret	Description
`DATABRICKS_HOST`	Workspace URL (e.g., `https://your-workspace.cloud.databricks.com`)
`DATABRICKS_CLIENT_ID`	Service principal OAuth M2M client ID
`DATABRICKS_CLIENT_SECRET`	Service principal OAuth M2M client secret
`DATABRICKS_CLUSTER_ID`	Existing all-purpose cluster ID for running job tasks
`UC_CATALOG`	Unity Catalog name for schema and volume creation
`DEVS_GROUP`	Databricks group name for job and catalog permissions

Note: The DEVS_GROUP (e.g., migration-accelerator-devs) must exist in Databricks before deployment. Create it in Admin Console → Groups → Create Group.

Secrets vs GitHub Secrets: Databricks Secrets (in the scope) are read at runtime by the jobs. GitHub Secrets are used at deploy time by the CI/CD pipeline to authenticate and configure the bundle.

After deployment

Once deployed, get the service principal name from the Databricks App in Compute->Apps->dbx-job-executor-app->Authorization->App Authorization and th job id from Jobs & Pipelines->snowflake_ingestion_job->Job Details->Job ID. Then add this Service Principal to the developers permission group specified in the variable DEVS_GROUP in the Github Secrets.

Handle Results

The results are stored in /Volumes/<databricks_host>/migration_accelerator/outputs/, these are the SQL files that will create the databricks artifacts once ran.

The recommended order of SQL files to run is: Roles → Stages → Tables → Streams → Pipes → Views → UDFs → Procedures → Grants

Run Locally (translation job)

For a stylish, repeatable local run, use the included helper script or Makefile target.

Quick (recommended):

# run via make (creates timestamped output dir)
make translate

Direct script (more control):

# run all example input files, batch size 2, produce SQL files
./scripts/run_translation.sh

# pass a custom glob, batch size, or output format (json/sql)
./scripts/run_translation.sh "src/artifact_translation_package/examples/*.json" 4 json

Notes:

make translate invokes the script at scripts/run_translation.sh and is the cleanest option.
The script exports PYTHONPATH=src so you can run it from the repository root.
Output is written to src/artifact_translation_package/out_sql_examples_<timestamp>.

Local vs Databricks output paths

This project uses Databricks-style paths (for example dbfs:/...) as the canonical configuration. To make the same code run locally without changing the canonical config, the runner maps Databricks paths to a local directory when it detects a non-Databricks runtime.

Local mapping environment variable: set LOCAL_DBFS_MOUNT to the local directory that should act as the root for dbfs:/ paths. Default: ./ddl_output.
Per-run results_dir: when running the translation job locally (for example via make translate), the job pre-creates a timestamped results_dir and propagates it to the translation context. Evaluation results are written under <results_dir>/evaluation_results/. Translation outputs such as translation_results.json and results_summary.json are written into the same results_dir.
Fallback behavior: the runner also exports DDL_OUTPUT_DIR=<results_dir> so code paths that read the DDL_OUTPUT_DIR environment variable will also target the per-run folder.

Examples

# run with default local mapping (output under ./ddl_output by default)
make translate

# override where dbfs:/ maps to locally
export LOCAL_DBFS_MOUNT=/tmp/my_local_dbfs
make translate

# inspect latest run outputs
ls -la src/artifact_translation_package/out_sql_examples_*/
ls -la <that-run-folder>/evaluation_results/

Name		Name	Last commit message	Last commit date
Latest commit History 169 Commits
.github/workflows		.github/workflows
databricks_job_executor		databricks_job_executor
docs		docs
resources		resources
scripts		scripts
sql_glot_concept		sql_glot_concept
src		src
translation_graph		translation_graph
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
databricks.yml		databricks.yml
env.example		env.example
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements.txt		requirements.txt
run_local_benchmark.py		run_local_benchmark.py
snowflake_objects_snowpark.json		snowflake_objects_snowpark.json
snowflake_test_objects.sql		snowflake_test_objects.sql
snowpark.py		snowpark.py
tags.yml		tags.yml
test_validation_logging.py		test_validation_logging.py
test_validation_update.py		test_validation_update.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Migration Accelerator

Overview

Getting Started for New Clients

Repository Setup: Fork the Repository

Step 1: Fork the Repository

Step 2: Clone Your Fork

Step 3: Configure Upstream Remote

Step 4: Create a Client Branch

Pulling Upstream Updates

Contributing Back

Post-Setup Checklist

Files

Usage

Reading Snowflake Objects with Python

Setup

Using snowpark.py

Test Objects

Configuration

Credentials Configuration

Databricks Deployment

Databricks Secrets

Cluster Environment Variables

GitHub Secrets (for CI/CD)

After deployment

Handle Results

Run Locally (translation job)

Local vs Databricks output paths

About

Uh oh!

Releases

Packages

Contributors 4

Uh oh!

Languages

thisisqubika/data-migration-accelerator

Folders and files

Latest commit

History

Repository files navigation

Data Migration Accelerator

Overview

Getting Started for New Clients

Repository Setup: Fork the Repository

Step 1: Fork the Repository

Step 2: Clone Your Fork

Step 3: Configure Upstream Remote

Step 4: Create a Client Branch

Pulling Upstream Updates

Contributing Back

Post-Setup Checklist

Files

Usage

Reading Snowflake Objects with Python

Setup

Using snowpark.py

Test Objects

Configuration

Credentials Configuration

Databricks Deployment

Databricks Secrets

Cluster Environment Variables

GitHub Secrets (for CI/CD)

After deployment

Handle Results

Run Locally (translation job)

Local vs Databricks output paths

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Uh oh!

Languages

Packages