A project for testing and validating Snowflake to Databricks migration tools and extractors.
This project contains Snowflake test objects that can be used to validate extractors and conversion tools that migrate Snowflake database objects to Databricks.
This section guides new clients through setting up the Data Migration Accelerator for their own projects.
Forking creates a copy under your GitHub organization while maintaining a connection to the original repository for future updates.
- Navigate to the original repository on GitHub
- Click Fork in the top-right corner
- Select your organization as the destination
- Uncheck "Copy the
mainbranch only" if you want all branches
git clone https://github.com/YOUR_ORG/data-migration-accelerator.git
cd data-migration-accelerator# Add the original repository as "upstream"
git remote add upstream https://github.com/thisisqubika/data-migration-accelerator.git
# Verify remotes
git remote -v
# origin https://github.com/YOUR_ORG/data-migration-accelerator.git (fetch)
# origin https://github.com/YOUR_ORG/data-migration-accelerator.git (push)
# upstream https://github.com/thisisqubika/data-migration-accelerator.git (fetch)
# upstream https://github.com/thisisqubika/data-migration-accelerator.git (push)Keep your customizations separate from main for easier upstream merges:
git checkout -b client/your-company-nameWhen the original accelerator has updates you want to incorporate:
# Fetch upstream changes
git fetch upstream
# Merge upstream main into your branch
git checkout main
git merge upstream/main
# Push updated main to your fork
git push origin main
# Rebase your client branch on updated main
git checkout client/your-company-name
git rebase mainIf you make improvements that could benefit others:
- Create a feature branch from
main - Make your changes
- Push to your fork
- Open a Pull Request to the upstream repository
After forking the repository:
- Update
databricks.ymlwith your bundle name - Configure GitHub Secrets for CI/CD (see GitHub Secrets)
- Set up Databricks Secrets scope (see Databricks Secrets)
- Configure cluster environment variables (see Cluster Environment Variables)
- Update
env.examplewith your default values - Create the required Databricks group (
DEVS_GROUP) - Test the deployment pipeline
- snowflake_test_objects.sql - Contains sample Snowflake objects (tables, views, procedures, functions, etc.) with the
data_migrationnaming convention for testing migration tools. - snowpark.py - Python script using Snowpark API to read and extract all Snowflake objects from the database.
- CONFIGURATION.md - Detailed guide on configuring Snowflake credentials.
- Execute
snowflake_test_objects.sqlin your Snowflake environment to create the test objects - Use your migration tool/extractor to convert these Snowflake objects to Databricks
- Validate the converted objects in Databricks
To read all Snowflake objects programmatically using Snowpark:
-
Set up virtual environment (recommended):
# Create virtual environment python3 -m venv venv # Activate virtual environment # On macOS/Linux: source venv/bin/activate # On Windows: # venv\Scripts\activate # Install dependencies pip install -r requirements.txt
Note: The virtual environment has already been created and dependencies installed. To activate it:
source venv/bin/activate -
Configure credentials:
- Copy the example environment file:
cp env.example .env
- Edit
.envwith your Snowflake credentials (see CONFIGURATION.md for details):- Required:
SNOWFLAKE_ACCOUNT,SNOWFLAKE_USER,SNOWFLAKE_PASSWORD - Optional:
SNOWFLAKE_DATABASE,SNOWFLAKE_SCHEMA,SNOWFLAKE_WAREHOUSE,SNOWFLAKE_ROLE,SNOWFLAKE_REGION
- Required:
- Copy the example environment file:
# Make sure virtual environment is activated
source venv/bin/activate # On macOS/Linux
# Run the script
python snowpark.pyFeatures:
- Uses Snowpark API for data processing
- Uses password authentication
- Reads all objects (tables, views, procedures, functions, sequences, stages, file formats, tasks, streams, pipes)
- Includes sample data from tables (first 10 rows)
- Queries specific test objects from
snowflake_test_objects.sql - Displays a summary
- Saves results to
snowflake_objects_snowpark.json
The SQL file includes:
- Tables:
data_migration_source,data_migration_target - Views: Various views for data migration summaries and status
- Stored Procedures: Procedures for querying migration data
- User-Defined Functions: Scalar and table functions
- Other Objects: Sequences, stages, file formats, tasks, streams, and pipes
- Database:
DATA_MIGRATION_DB - Schema:
DATA_MIGRATION_SCHEMA - Data Retention: 1 day (max retention)
For detailed instructions on configuring Snowflake credentials, see CONFIGURATION.md.
Quick Setup:
-
Copy the example environment file:
cp env.example .env
-
Edit
.envwith your credentials:SNOWFLAKE_ACCOUNT=your_account_identifier SNOWFLAKE_USER=your_username SNOWFLAKE_PASSWORD=your_password SNOWFLAKE_DATABASE=your_database # Required SNOWFLAKE_SCHEMA=your_schema # Required SNOWFLAKE_WAREHOUSE=COMPUTE_WH # Optional SNOWFLAKE_ROLE=SYSADMIN # Optional # Unity Catalog - Required UC_CATALOG=your_catalog_name UC_SCHEMA=migration_accelerator
-
The
.envfile is already in.gitignoreto protect your credentials
To run on Databricks, configure the following:
Create a secrets scope and add credentials. These secrets are used at runtime by the Databricks jobs:
databricks secrets create-scope migration-accelerator
databricks secrets put-secret migration-accelerator SNOWFLAKE_ACCOUNT
databricks secrets put-secret migration-accelerator SNOWFLAKE_USER
databricks secrets put-secret migration-accelerator SNOWFLAKE_PASSWORD
databricks secrets put-secret migration-accelerator DATABRICKS_HOST
databricks secrets put-secret migration-accelerator DATABRICKS_CLIENT_ID
databricks secrets put-secret migration-accelerator DATABRICKS_CLIENT_SECRETNote: The Snowflake credentials (
SNOWFLAKE_ACCOUNT,SNOWFLAKE_USER,SNOWFLAKE_PASSWORD) are required for the ingestion job to connect to Snowflake. The Databricks credentials are used by the Job Executor App for API authentication.
Set in Cluster → Advanced Options → Spark → Environment Variables:
# Required - Unity Catalog configuration
UC_CATALOG=your_catalog_name
UC_SCHEMA=migration_accelerator
# Required - Snowflake source context
SNOWFLAKE_DATABASE=your_database
SNOWFLAKE_SCHEMA=your_schema
# Required - Translation output configuration
DDL_OUTPUT_DIR=/Volumes/your_catalog/migration_accelerator/outputs
DBX_ENDPOINT=databricks-llama-4-maverick
# Optional - Override defaults if needed
# SECRETS_SCOPE=migration-accelerator # Default: migration-accelerator
# UC_RAW_VOLUME=snowflake_artifacts_raw # Default: snowflake_artifacts_raw
# SNOWFLAKE_WAREHOUSE=COMPUTE_WH # Default: COMPUTE_WH
# SNOWFLAKE_ROLE=SYSADMIN # Default: SYSADMINThese secrets are used by GitHub Actions to deploy the Databricks Asset Bundle (not at runtime):
| Secret | Description |
|---|---|
DATABRICKS_HOST |
Workspace URL (e.g., https://your-workspace.cloud.databricks.com) |
DATABRICKS_CLIENT_ID |
Service principal OAuth M2M client ID |
DATABRICKS_CLIENT_SECRET |
Service principal OAuth M2M client secret |
DATABRICKS_CLUSTER_ID |
Existing all-purpose cluster ID for running job tasks |
UC_CATALOG |
Unity Catalog name for schema and volume creation |
DEVS_GROUP |
Databricks group name for job and catalog permissions |
Note: The
DEVS_GROUP(e.g.,migration-accelerator-devs) must exist in Databricks before deployment. Create it in Admin Console → Groups → Create Group.
Secrets vs GitHub Secrets: Databricks Secrets (in the scope) are read at runtime by the jobs. GitHub Secrets are used at deploy time by the CI/CD pipeline to authenticate and configure the bundle.
Once deployed, get the service principal name from the Databricks App in Compute->Apps->dbx-job-executor-app->Authorization->App Authorization and th job id from Jobs & Pipelines->snowflake_ingestion_job->Job Details->Job ID. Then add this Service Principal to the developers permission group specified in the variable DEVS_GROUP in the Github Secrets.
The results are stored in /Volumes/<databricks_host>/migration_accelerator/outputs/, these are the SQL files that will create the databricks artifacts once ran.
The recommended order of SQL files to run is: Roles → Stages → Tables → Streams → Pipes → Views → UDFs → Procedures → Grants
For a stylish, repeatable local run, use the included helper script or Makefile target.
Quick (recommended):
# run via make (creates timestamped output dir)
make translateDirect script (more control):
# run all example input files, batch size 2, produce SQL files
./scripts/run_translation.sh
# pass a custom glob, batch size, or output format (json/sql)
./scripts/run_translation.sh "src/artifact_translation_package/examples/*.json" 4 jsonNotes:
make translateinvokes the script atscripts/run_translation.shand is the cleanest option.- The script exports
PYTHONPATH=srcso you can run it from the repository root. - Output is written to
src/artifact_translation_package/out_sql_examples_<timestamp>.
This project uses Databricks-style paths (for example dbfs:/...) as the canonical configuration. To make the same code run locally without changing the canonical config, the runner maps Databricks paths to a local directory when it detects a non-Databricks runtime.
- Local mapping environment variable: set
LOCAL_DBFS_MOUNTto the local directory that should act as the root fordbfs:/paths. Default:./ddl_output. - Per-run
results_dir: when running the translation job locally (for example viamake translate), the job pre-creates a timestampedresults_dirand propagates it to the translation context. Evaluation results are written under<results_dir>/evaluation_results/. Translation outputs such astranslation_results.jsonandresults_summary.jsonare written into the sameresults_dir. - Fallback behavior: the runner also exports
DDL_OUTPUT_DIR=<results_dir>so code paths that read theDDL_OUTPUT_DIRenvironment variable will also target the per-run folder.
Examples
# run with default local mapping (output under ./ddl_output by default)
make translate
# override where dbfs:/ maps to locally
export LOCAL_DBFS_MOUNT=/tmp/my_local_dbfs
make translate
# inspect latest run outputs
ls -la src/artifact_translation_package/out_sql_examples_*/
ls -la <that-run-folder>/evaluation_results/