Add OSI ↔ Databricks Unity Catalog Metric View converter by STHITAPRAJNAS · Pull Request #120 · open-semantic-interchange/OSI

STHITAPRAJNAS · 2026-05-08T22:41:23Z

Summary

Adds a bidirectional converter between OSI YAML semantic models and
Databricks Unity Catalog Metric View
YAML, under converters/databricks/. Pure offline conversion, Python +
PyYAML, structured like the existing Snowflake converter but with both
directions and round-trip coverage.

DATABRICKS is already in the vendors enum and the dialect enum, so
this PR adds converter code only; no spec change.

Mapping

OSI	Databricks UC Metric View (`version: 1.1`)
primary fact dataset	`source`
other reachable datasets	`joins[]` (`source` + `sql_on`)
`dataset.fields[]`	`dimensions[]`
`metrics[]`	`measures[]`
`expression.dialects[DATABRICKS]` (else `ANSI_SQL`)	`expr`
`relationships[].from_columns` / `to_columns`	`sql_on` boolean
`custom_extensions[DATABRICKS]` (`filter`, `primary_dataset`, `raw_joins`)	top-level `filter`, source selection, restored joins

Primary fact dataset is selected in priority order:

custom_extensions[vendor_name=DATABRICKS] with {"primary_dataset": "..."}
The dataset most often on the from side of relationships
The first dataset declared

Datasets unreachable from the primary via the relationship graph emit a
warning and are excluded.

Dimension expression qualification. Bare single-identifier
expressions (e.g. expression: customer_id on customer) are
auto-qualified as customer.customer_id so they resolve unambiguously
after joins. Multi-token expressions (operators, function calls,
multi-column references) are emitted verbatim — string-prepending only
qualifies the first identifier and would silently break the rest. For
computed fields on a non-primary dataset, supply a DATABRICKS dialect
entry that's already table-qualified; the converter prefers it and
silences the warning.

Round-trip behaviour

Round-trip OSI → UC → OSI keeps datasets, relationships, metrics, and
expressions. UC-only fields (filter, the metric view version,
unparseable sql_on clauses) are recorded in
custom_extensions[DATABRICKS] on import and re-emitted on export, so
UC → OSI → UC also survives.

Documented lossy paths:

OSI dimension.is_time has no dedicated UC counterpart (see Add datatype field to Field and Metric; reframe is_time as role marker #113 —
the field classifier is isolated so the migration to datatype is a
one-function change).
ai_context on relationships is dropped on export with a warning.
Non-DATABRICKS / non-ANSI_SQL dialect entries are skipped with a
warning.

Coordination with other open PRs

Add Apache Spark converter module #93 (Apache Spark converter) — different scope: that PR generates
generic PySpark code with a DATABRICKS dialect flag, not a Unity
Catalog metric view. No code overlap; only the DATABRICKS dialect
string is shared.
Add datatype field to Field and Metric; reframe is_time as role marker #113 (datatype field, reframe is_time) — when it lands, the
field classifier becomes a one-function update.

Test plan

python3 -m pytest converters/databricks/tests/ -q → 47 passed
python3 -m pyflakes converters/databricks/src/ converters/databricks/tests/ → clean
python3 -m ruff check converters/databricks/src/ converters/databricks/tests/ → clean
python3 validation/validate.py examples/tpcds_semantic_model.yaml → passes (TPC-DS is the round-trip fixture)
TPC-DS converts end-to-end; round-trip preserves dataset names, relationship from_columns/to_columns, and metric expressions

Bidirectional converter between OSI YAML semantic models and Databricks Unity Catalog Metric View YAML, structured like the existing Snowflake converter (Python, PyYAML-only) but with both directions and round-trip tests from day one. OSI -> Databricks: - Selects a primary fact dataset (custom_extensions[DATABRICKS] hint, then highest from-count, then first declared) for the metric view source. - Walks the relationship graph from the primary; emits each remaining reachable dataset as a UC join with sql_on built from from_columns and to_columns, supporting composite keys. - Maps OSI fields to UC dimensions (qualifying bare column expressions with the dataset name; disambiguating duplicate names) and OSI metrics to UC measures. - Prefers the DATABRICKS dialect with ANSI_SQL fallback; warns and skips fields with neither. - Warns about unreachable datasets and dropped fields; preserves the filter clause from a DATABRICKS custom_extension. Databricks -> OSI: - Splits the UC source and joins back into separate OSI datasets, attributing dimensions by parsing table.column expressions. - Rebuilds OSI relationships from sql_on / using clauses; preserves unparseable joins and filter in custom_extensions[DATABRICKS] for round-trip fidelity. Tests cover OSI -> UC, UC -> OSI, round-trip, and the TPC-DS example fixture (43 tests, pyflakes + ruff clean). ROADMAP and .gitignore updated.

Two correctness issues found while reviewing the converter against the real Databricks Unity Catalog metric view spec and the TPC-DS example: 1. Multi-token expressions on non-primary datasets were being mangled. The previous logic prepended `<dataset>.` to the whole expression string, which only qualified the first identifier. A computed field like `c_first_name || ' ' || c_last_name` on `customer` became `customer.c_first_name || ' ' || c_last_name` — the second column reference was unqualified and ambiguous after joins, which UC would reject at query time. Now only bare single-identifier expressions are auto-qualified; multi-token expressions are emitted verbatim and the user is warned to supply a DATABRICKS dialect with explicit table qualifiers when a computed field lives on a join. 2. The metric view version was hardcoded to `0.1`, but Microsoft Learn and current Databricks docs use `1.1` for new deployments. Bumped to `1.1`. Both versions are still accepted by Databricks but emitting the current one matches what UC produces today. Also restore raw_joins from custom_extensions[DATABRICKS] on export so unparseable join clauses recorded by the inverse converter survive a UC -> OSI -> UC round trip. README updated; tests cover both fixes plus the warning-suppression path when a DATABRICKS dialect is supplied. 47 passed (was 43), pyflakes + ruff clean, validator still passes on the TPC-DS example.

Add Ontology & Semantic Interoperability as a top-level current effort with links to discussions open-semantic-interchange#22, open-semantic-interchange#101, open-semantic-interchange#108, open-semantic-interchange#68 and PRs open-semantic-interchange#124, open-semantic-interchange#125. Update existing sections with recently opened discussions, PRs, and converters: metric trees (open-semantic-interchange#40), primary key semantics (open-semantic-interchange#15, open-semantic-interchange#119), reusable datasets (open-semantic-interchange#103, open-semantic-interchange#109), datatype/is_time reframe (PR open-semantic-interchange#113), spatial dimension types (open-semantic-interchange#114), default_aggregation (open-semantic-interchange#115), positive direction (open-semantic-interchange#41), physical metadata (open-semantic-interchange#110), and new converter PRs for Salesforce (open-semantic-interchange#118), dbt (open-semantic-interchange#116), and Databricks (open-semantic-interchange#120). Also adds CONTRIBUTING.md (open-semantic-interchange#122) and working groups page (open-semantic-interchange#123) to Developer Experience, and lists merged converters as existing artifacts. Made-with: Cursor

Add Ontology & Semantic Interoperability as a top-level current effort with links to discussions #22, #101, #108, #68 and PRs #124, #125. Update existing sections with recently opened discussions, PRs, and converters: metric trees (#40), primary key semantics (#15, #119), reusable datasets (#103, #109), datatype/is_time reframe (PR #113), spatial dimension types (#114), default_aggregation (#115), positive direction (#41), physical metadata (#110), and new converter PRs for Salesforce (#118), dbt (#116), and Databricks (#120). Also adds CONTRIBUTING.md (#122) and working groups page (#123) to Developer Experience, and lists merged converters as existing artifacts. Made-with: Cursor

khush-bhatia · 2026-05-22T15:51:10Z

Hey @STHITAPRAJNAS Thanks for the contribution. We are in process of setting up the CLA hooks to prepare for donation. If you would like to merge this PR before the hooks are setup, could you please join this slack workspace to discuss the steps for the CLA? This is standard process for any ASF project.

jbonofre

The PR is solid. I left a couple of comments worth to clarify.

Thanks!

jbonofre · 2026-06-02T15:02:53Z

+    return metric
+
+
+def _parse_join_clause(join, primary_name, joined_name):


I think it conflates join alias with dataset name.

joined_name is derived via _name_from_source(j_source), so the last segment of the table path.

But sql_on clauses can reference the join's name (alias) rather than the source table.
Today the parser requires l_table == primary_name and r_table == joined_name. If an author writes joins: [{name: c, source: main.sales.customers, sql_on: "orders.customer_id = c.id"}], the parser returns None and the relationship is lost (raw clause is preserved, but the OSI graph is incomplete).

You should consider matching against both joined_name and the join's name field.

jbonofre · 2026-06-02T15:18:01Z

+    return result
+
+
+def _combined_description(node):


_combined_description only honours string ai_context. The schema's oneOf allows an object with instructions and synonyms. It means a model carrying ai_context: {instructions: "..."} will silently lose that text without warning.

I believe we should either flatten instructions into comment, or warn like the relationship ai_context path does.

jbonofre · 2026-06-02T15:21:37Z

+    if raw_joins:
+        preserved["raw_joins"] = raw_joins
+
+    if len(preserved) <= 1:  # Only primary_dataset means nothing extra to keep


I'm confused here:

preserved is initially with primary_dataset.

So len(preserved) <= 1 always means "exactly 1", and preserved if preserved else None always returns preserved. So the if is a no-op. Should we remove it?

STHITAPRAJNAS added 2 commits May 8, 2026 22:14

jklahr mentioned this pull request May 18, 2026

Add ontology section and update roadmap with latest discussions #126

Merged

jbonofre self-requested a review June 2, 2026 14:55

jbonofre reviewed Jun 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add OSI ↔ Databricks Unity Catalog Metric View converter#120

Add OSI ↔ Databricks Unity Catalog Metric View converter#120
STHITAPRAJNAS wants to merge 2 commits into
open-semantic-interchange:mainfrom
STHITAPRAJNAS:databricks-converter

STHITAPRAJNAS commented May 8, 2026

Uh oh!

khush-bhatia commented May 22, 2026

Uh oh!

jbonofre left a comment

Uh oh!

jbonofre Jun 2, 2026

Uh oh!

jbonofre Jun 2, 2026

Uh oh!

jbonofre Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		return metric


		def _parse_join_clause(join, primary_name, joined_name):

Conversation

STHITAPRAJNAS commented May 8, 2026

Summary

Mapping

Round-trip behaviour

Coordination with other open PRs

Test plan

Uh oh!

khush-bhatia commented May 22, 2026

Uh oh!

jbonofre left a comment

Choose a reason for hiding this comment

Uh oh!

jbonofre Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

jbonofre Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

jbonofre Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants