Skip to content

Add OSI ↔ Databricks Unity Catalog Metric View converter#120

Open
STHITAPRAJNAS wants to merge 2 commits into
open-semantic-interchange:mainfrom
STHITAPRAJNAS:databricks-converter
Open

Add OSI ↔ Databricks Unity Catalog Metric View converter#120
STHITAPRAJNAS wants to merge 2 commits into
open-semantic-interchange:mainfrom
STHITAPRAJNAS:databricks-converter

Conversation

@STHITAPRAJNAS
Copy link
Copy Markdown

Summary

Adds a bidirectional converter between OSI YAML semantic models and
Databricks Unity Catalog Metric View
YAML, under converters/databricks/. Pure offline conversion, Python +
PyYAML, structured like the existing Snowflake converter but with both
directions and round-trip coverage.

DATABRICKS is already in the vendors enum and the dialect enum, so
this PR adds converter code only; no spec change.

Mapping

OSI Databricks UC Metric View (version: 1.1)
primary fact dataset source
other reachable datasets joins[] (source + sql_on)
dataset.fields[] dimensions[]
metrics[] measures[]
expression.dialects[DATABRICKS] (else ANSI_SQL) expr
relationships[].from_columns / to_columns sql_on boolean
custom_extensions[DATABRICKS] (filter, primary_dataset, raw_joins) top-level filter, source selection, restored joins

Primary fact dataset is selected in priority order:

  1. custom_extensions[vendor_name=DATABRICKS] with {"primary_dataset": "..."}
  2. The dataset most often on the from side of relationships
  3. The first dataset declared

Datasets unreachable from the primary via the relationship graph emit a
warning and are excluded.

Dimension expression qualification. Bare single-identifier
expressions (e.g. expression: customer_id on customer) are
auto-qualified as customer.customer_id so they resolve unambiguously
after joins. Multi-token expressions (operators, function calls,
multi-column references) are emitted verbatim — string-prepending only
qualifies the first identifier and would silently break the rest. For
computed fields on a non-primary dataset, supply a DATABRICKS dialect
entry that's already table-qualified; the converter prefers it and
silences the warning.

Round-trip behaviour

Round-trip OSI → UC → OSI keeps datasets, relationships, metrics, and
expressions. UC-only fields (filter, the metric view version,
unparseable sql_on clauses) are recorded in
custom_extensions[DATABRICKS] on import and re-emitted on export, so
UC → OSI → UC also survives.

Documented lossy paths:

Coordination with other open PRs

Test plan

  • python3 -m pytest converters/databricks/tests/ -q → 47 passed
  • python3 -m pyflakes converters/databricks/src/ converters/databricks/tests/ → clean
  • python3 -m ruff check converters/databricks/src/ converters/databricks/tests/ → clean
  • python3 validation/validate.py examples/tpcds_semantic_model.yaml → passes (TPC-DS is the round-trip fixture)
  • TPC-DS converts end-to-end; round-trip preserves dataset names, relationship from_columns/to_columns, and metric expressions

Bidirectional converter between OSI YAML semantic models and Databricks
Unity Catalog Metric View YAML, structured like the existing Snowflake
converter (Python, PyYAML-only) but with both directions and round-trip
tests from day one.

OSI -> Databricks:
- Selects a primary fact dataset (custom_extensions[DATABRICKS] hint, then
  highest from-count, then first declared) for the metric view source.
- Walks the relationship graph from the primary; emits each remaining
  reachable dataset as a UC join with sql_on built from from_columns and
  to_columns, supporting composite keys.
- Maps OSI fields to UC dimensions (qualifying bare column expressions
  with the dataset name; disambiguating duplicate names) and OSI metrics
  to UC measures.
- Prefers the DATABRICKS dialect with ANSI_SQL fallback; warns and skips
  fields with neither.
- Warns about unreachable datasets and dropped fields; preserves the
  filter clause from a DATABRICKS custom_extension.

Databricks -> OSI:
- Splits the UC source and joins back into separate OSI datasets,
  attributing dimensions by parsing table.column expressions.
- Rebuilds OSI relationships from sql_on / using clauses; preserves
  unparseable joins and filter in custom_extensions[DATABRICKS] for
  round-trip fidelity.

Tests cover OSI -> UC, UC -> OSI, round-trip, and the TPC-DS example
fixture (43 tests, pyflakes + ruff clean). ROADMAP and .gitignore
updated.
Two correctness issues found while reviewing the converter against the
real Databricks Unity Catalog metric view spec and the TPC-DS example:

1. Multi-token expressions on non-primary datasets were being mangled.
   The previous logic prepended `<dataset>.` to the whole expression
   string, which only qualified the first identifier. A computed field
   like `c_first_name || ' ' || c_last_name` on `customer` became
   `customer.c_first_name || ' ' || c_last_name` — the second column
   reference was unqualified and ambiguous after joins, which UC would
   reject at query time. Now only bare single-identifier expressions
   are auto-qualified; multi-token expressions are emitted verbatim and
   the user is warned to supply a DATABRICKS dialect with explicit
   table qualifiers when a computed field lives on a join.

2. The metric view version was hardcoded to `0.1`, but Microsoft Learn
   and current Databricks docs use `1.1` for new deployments. Bumped to
   `1.1`. Both versions are still accepted by Databricks but emitting
   the current one matches what UC produces today.

Also restore raw_joins from custom_extensions[DATABRICKS] on export so
unparseable join clauses recorded by the inverse converter survive a
UC -> OSI -> UC round trip. README updated; tests cover both fixes
plus the warning-suppression path when a DATABRICKS dialect is supplied.

47 passed (was 43), pyflakes + ruff clean, validator still passes on
the TPC-DS example.
jklahr pushed a commit to jklahr/jklahr-osi that referenced this pull request May 18, 2026
Add Ontology & Semantic Interoperability as a top-level current effort
with links to discussions open-semantic-interchange#22, open-semantic-interchange#101, open-semantic-interchange#108, open-semantic-interchange#68 and PRs open-semantic-interchange#124, open-semantic-interchange#125.

Update existing sections with recently opened discussions, PRs, and
converters: metric trees (open-semantic-interchange#40), primary key semantics (open-semantic-interchange#15, open-semantic-interchange#119),
reusable datasets (open-semantic-interchange#103, open-semantic-interchange#109), datatype/is_time reframe (PR open-semantic-interchange#113),
spatial dimension types (open-semantic-interchange#114), default_aggregation (open-semantic-interchange#115), positive
direction (open-semantic-interchange#41), physical metadata (open-semantic-interchange#110), and new converter PRs for
Salesforce (open-semantic-interchange#118), dbt (open-semantic-interchange#116), and Databricks (open-semantic-interchange#120). Also adds
CONTRIBUTING.md (open-semantic-interchange#122) and working groups page (open-semantic-interchange#123) to Developer
Experience, and lists merged converters as existing artifacts.

Made-with: Cursor
jklahr pushed a commit that referenced this pull request May 19, 2026
Add Ontology & Semantic Interoperability as a top-level current effort
with links to discussions #22, #101, #108, #68 and PRs #124, #125.

Update existing sections with recently opened discussions, PRs, and
converters: metric trees (#40), primary key semantics (#15, #119),
reusable datasets (#103, #109), datatype/is_time reframe (PR #113),
spatial dimension types (#114), default_aggregation (#115), positive
direction (#41), physical metadata (#110), and new converter PRs for
Salesforce (#118), dbt (#116), and Databricks (#120). Also adds
CONTRIBUTING.md (#122) and working groups page (#123) to Developer
Experience, and lists merged converters as existing artifacts.

Made-with: Cursor
@khush-bhatia
Copy link
Copy Markdown
Member

Hey @STHITAPRAJNAS Thanks for the contribution. We are in process of setting up the CLA hooks to prepare for donation. If you would like to merge this PR before the hooks are setup, could you please join this slack workspace to discuss the steps for the CLA? This is standard process for any ASF project.

@jbonofre jbonofre self-requested a review June 2, 2026 14:55
Copy link
Copy Markdown
Collaborator

@jbonofre jbonofre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR is solid. I left a couple of comments worth to clarify.

Thanks!

return metric


def _parse_join_clause(join, primary_name, joined_name):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it conflates join alias with dataset name.

joined_name is derived via _name_from_source(j_source), so the last segment of the table path.

But sql_on clauses can reference the join's name (alias) rather than the source table.
Today the parser requires l_table == primary_name and r_table == joined_name. If an author writes joins: [{name: c, source: main.sales.customers, sql_on: "orders.customer_id = c.id"}], the parser returns None and the relationship is lost (raw clause is preserved, but the OSI graph is incomplete).

You should consider matching against both joined_name and the join's name field.

return result


def _combined_description(node):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_combined_description only honours string ai_context. The schema's oneOf allows an object with instructions and synonyms. It means a model carrying ai_context: {instructions: "..."} will silently lose that text without warning.

I believe we should either flatten instructions into comment, or warn like the relationship ai_context path does.

if raw_joins:
preserved["raw_joins"] = raw_joins

if len(preserved) <= 1: # Only primary_dataset means nothing extra to keep
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused here:

preserved is initially with primary_dataset.

So len(preserved) <= 1 always means "exactly 1", and preserved if preserved else None always returns preserved. So the if is a no-op. Should we remove it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants