Add OSI ↔ Databricks Unity Catalog Metric View converter#120
Add OSI ↔ Databricks Unity Catalog Metric View converter#120STHITAPRAJNAS wants to merge 2 commits into
Conversation
Bidirectional converter between OSI YAML semantic models and Databricks Unity Catalog Metric View YAML, structured like the existing Snowflake converter (Python, PyYAML-only) but with both directions and round-trip tests from day one. OSI -> Databricks: - Selects a primary fact dataset (custom_extensions[DATABRICKS] hint, then highest from-count, then first declared) for the metric view source. - Walks the relationship graph from the primary; emits each remaining reachable dataset as a UC join with sql_on built from from_columns and to_columns, supporting composite keys. - Maps OSI fields to UC dimensions (qualifying bare column expressions with the dataset name; disambiguating duplicate names) and OSI metrics to UC measures. - Prefers the DATABRICKS dialect with ANSI_SQL fallback; warns and skips fields with neither. - Warns about unreachable datasets and dropped fields; preserves the filter clause from a DATABRICKS custom_extension. Databricks -> OSI: - Splits the UC source and joins back into separate OSI datasets, attributing dimensions by parsing table.column expressions. - Rebuilds OSI relationships from sql_on / using clauses; preserves unparseable joins and filter in custom_extensions[DATABRICKS] for round-trip fidelity. Tests cover OSI -> UC, UC -> OSI, round-trip, and the TPC-DS example fixture (43 tests, pyflakes + ruff clean). ROADMAP and .gitignore updated.
Two correctness issues found while reviewing the converter against the real Databricks Unity Catalog metric view spec and the TPC-DS example: 1. Multi-token expressions on non-primary datasets were being mangled. The previous logic prepended `<dataset>.` to the whole expression string, which only qualified the first identifier. A computed field like `c_first_name || ' ' || c_last_name` on `customer` became `customer.c_first_name || ' ' || c_last_name` — the second column reference was unqualified and ambiguous after joins, which UC would reject at query time. Now only bare single-identifier expressions are auto-qualified; multi-token expressions are emitted verbatim and the user is warned to supply a DATABRICKS dialect with explicit table qualifiers when a computed field lives on a join. 2. The metric view version was hardcoded to `0.1`, but Microsoft Learn and current Databricks docs use `1.1` for new deployments. Bumped to `1.1`. Both versions are still accepted by Databricks but emitting the current one matches what UC produces today. Also restore raw_joins from custom_extensions[DATABRICKS] on export so unparseable join clauses recorded by the inverse converter survive a UC -> OSI -> UC round trip. README updated; tests cover both fixes plus the warning-suppression path when a DATABRICKS dialect is supplied. 47 passed (was 43), pyflakes + ruff clean, validator still passes on the TPC-DS example.
Add Ontology & Semantic Interoperability as a top-level current effort with links to discussions open-semantic-interchange#22, open-semantic-interchange#101, open-semantic-interchange#108, open-semantic-interchange#68 and PRs open-semantic-interchange#124, open-semantic-interchange#125. Update existing sections with recently opened discussions, PRs, and converters: metric trees (open-semantic-interchange#40), primary key semantics (open-semantic-interchange#15, open-semantic-interchange#119), reusable datasets (open-semantic-interchange#103, open-semantic-interchange#109), datatype/is_time reframe (PR open-semantic-interchange#113), spatial dimension types (open-semantic-interchange#114), default_aggregation (open-semantic-interchange#115), positive direction (open-semantic-interchange#41), physical metadata (open-semantic-interchange#110), and new converter PRs for Salesforce (open-semantic-interchange#118), dbt (open-semantic-interchange#116), and Databricks (open-semantic-interchange#120). Also adds CONTRIBUTING.md (open-semantic-interchange#122) and working groups page (open-semantic-interchange#123) to Developer Experience, and lists merged converters as existing artifacts. Made-with: Cursor
Add Ontology & Semantic Interoperability as a top-level current effort with links to discussions #22, #101, #108, #68 and PRs #124, #125. Update existing sections with recently opened discussions, PRs, and converters: metric trees (#40), primary key semantics (#15, #119), reusable datasets (#103, #109), datatype/is_time reframe (PR #113), spatial dimension types (#114), default_aggregation (#115), positive direction (#41), physical metadata (#110), and new converter PRs for Salesforce (#118), dbt (#116), and Databricks (#120). Also adds CONTRIBUTING.md (#122) and working groups page (#123) to Developer Experience, and lists merged converters as existing artifacts. Made-with: Cursor
|
Hey @STHITAPRAJNAS Thanks for the contribution. We are in process of setting up the CLA hooks to prepare for donation. If you would like to merge this PR before the hooks are setup, could you please join this slack workspace to discuss the steps for the CLA? This is standard process for any ASF project. |
jbonofre
left a comment
There was a problem hiding this comment.
The PR is solid. I left a couple of comments worth to clarify.
Thanks!
| return metric | ||
|
|
||
|
|
||
| def _parse_join_clause(join, primary_name, joined_name): |
There was a problem hiding this comment.
I think it conflates join alias with dataset name.
joined_name is derived via _name_from_source(j_source), so the last segment of the table path.
But sql_on clauses can reference the join's name (alias) rather than the source table.
Today the parser requires l_table == primary_name and r_table == joined_name. If an author writes joins: [{name: c, source: main.sales.customers, sql_on: "orders.customer_id = c.id"}], the parser returns None and the relationship is lost (raw clause is preserved, but the OSI graph is incomplete).
You should consider matching against both joined_name and the join's name field.
| return result | ||
|
|
||
|
|
||
| def _combined_description(node): |
There was a problem hiding this comment.
_combined_description only honours string ai_context. The schema's oneOf allows an object with instructions and synonyms. It means a model carrying ai_context: {instructions: "..."} will silently lose that text without warning.
I believe we should either flatten instructions into comment, or warn like the relationship ai_context path does.
| if raw_joins: | ||
| preserved["raw_joins"] = raw_joins | ||
|
|
||
| if len(preserved) <= 1: # Only primary_dataset means nothing extra to keep |
There was a problem hiding this comment.
I'm confused here:
preserved is initially with primary_dataset.
So len(preserved) <= 1 always means "exactly 1", and preserved if preserved else None always returns preserved. So the if is a no-op. Should we remove it?
Summary
Adds a bidirectional converter between OSI YAML semantic models and
Databricks Unity Catalog Metric View
YAML, under
converters/databricks/. Pure offline conversion, Python +PyYAML, structured like the existing Snowflake converter but with both
directions and round-trip coverage.
DATABRICKSis already in thevendorsenum and the dialect enum, sothis PR adds converter code only; no spec change.
Mapping
version: 1.1)sourcejoins[](source+sql_on)dataset.fields[]dimensions[]metrics[]measures[]expression.dialects[DATABRICKS](elseANSI_SQL)exprrelationships[].from_columns/to_columnssql_onbooleancustom_extensions[DATABRICKS](filter,primary_dataset,raw_joins)filter, source selection, restored joinsPrimary fact dataset is selected in priority order:
custom_extensions[vendor_name=DATABRICKS]with{"primary_dataset": "..."}fromside of relationshipsDatasets unreachable from the primary via the relationship graph emit a
warning and are excluded.
Dimension expression qualification. Bare single-identifier
expressions (e.g.
expression: customer_idoncustomer) areauto-qualified as
customer.customer_idso they resolve unambiguouslyafter joins. Multi-token expressions (operators, function calls,
multi-column references) are emitted verbatim — string-prepending only
qualifies the first identifier and would silently break the rest. For
computed fields on a non-primary dataset, supply a
DATABRICKSdialectentry that's already table-qualified; the converter prefers it and
silences the warning.
Round-trip behaviour
Round-trip OSI → UC → OSI keeps datasets, relationships, metrics, and
expressions. UC-only fields (
filter, the metric viewversion,unparseable
sql_onclauses) are recorded incustom_extensions[DATABRICKS]on import and re-emitted on export, soUC → OSI → UC also survives.
Documented lossy paths:
dimension.is_timehas no dedicated UC counterpart (see Add datatype field to Field and Metric; reframe is_time as role marker #113 —the field classifier is isolated so the migration to
datatypeis aone-function change).
ai_contexton relationships is dropped on export with a warning.DATABRICKS/ non-ANSI_SQLdialect entries are skipped with awarning.
Coordination with other open PRs
generic PySpark code with a
DATABRICKSdialect flag, not a UnityCatalog metric view. No code overlap; only the
DATABRICKSdialectstring is shared.
datatypefield, reframeis_time) — when it lands, thefield classifier becomes a one-function update.
Test plan
python3 -m pytest converters/databricks/tests/ -q→ 47 passedpython3 -m pyflakes converters/databricks/src/ converters/databricks/tests/→ cleanpython3 -m ruff check converters/databricks/src/ converters/databricks/tests/→ cleanpython3 validation/validate.py examples/tpcds_semantic_model.yaml→ passes (TPC-DS is the round-trip fixture)from_columns/to_columns, and metric expressions