-
Notifications
You must be signed in to change notification settings - Fork 82
Add datatype field to Field and Metric; reframe is_time as role marker #113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
jonmmease
wants to merge
1
commit into
open-semantic-interchange:main
Choose a base branch
from
jonmmease:add-datatype-to-spec
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,2 @@ | ||
| __pycache__/ | ||
| *.py[cod] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -36,6 +36,22 @@ Supported SQL and expression language dialects for metrics and field definitions | |
| | `TABLEAU` | Tableau calculations | | ||
| | `DATABRICKS` | Databricks SQL | | ||
|
|
||
| ### Datatypes | ||
|
|
||
| Logical data types for fields and metrics. | ||
|
|
||
| | Datatype | Description | | ||
| |----------|-------------| | ||
| | `string` | Variable-length Unicode character data. | | ||
| | `integer` | Signed integer with no scale. | | ||
| | `number` | Real number (floating-point or decimal) with unspecified precision. | | ||
| | `boolean` | Logical two-valued truth type. | | ||
| | `date` | Calendar date with no time-of-day component. | | ||
| | `time` | Time-of-day with no date component. | | ||
| | `timestamp` | Instant-in-time without timezone offset (naive / local). | | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. maybe the default |
||
| | `timestamp_tz` | Instant-in-time with timezone offset (zoned). | | ||
| | `other` | Any data type not covered above; use `custom_extensions` for vendor-specific refinement. | | ||
|
|
||
| ### Vendors | ||
|
|
||
| Supported vendors for custom extensions and integrations. | ||
|
|
@@ -202,6 +218,7 @@ Fields represent row-level attributes that can be used for grouping, filtering, | |
| | `dimension` | object | No | Dimension metadata (e.g., `is_time` flag) | | ||
| | `label` | string | No | Label for categorization | | ||
| | `description` | string | No | Human-readable description | | ||
| | `datatype` | string (enum) | No | Logical data type for this field. See [Datatypes](#datatypes). | | ||
| | `ai_context` | string/object | No | Additional context for AI tools (e.g., synonyms) | | ||
| | `custom_extensions` | array | No | Vendor-specific attributes | | ||
|
|
||
|
|
@@ -228,7 +245,7 @@ expression: | |
|
|
||
| | Field | Type | Description | | ||
| |-------|------|-------------| | ||
| | `is_time` | boolean | Indicates if this is a time-based dimension for temporal filtering | | ||
| | `is_time` | boolean | Temporal-role marker. When `true`, consumers that distinguish time dimensions (e.g. for time-series analysis or temporal filtering) should treat this field as a time dimension. This is a *role* flag, independent of the field's data type. See [Datatype and `is_time`: type vs. role](#datatype-and-is_time-type-vs-role). | | ||
|
|
||
| ### Examples | ||
|
|
||
|
|
@@ -268,6 +285,7 @@ expression: | |
| dialects: | ||
| - dialect: ANSI_SQL | ||
| expression: order_date | ||
| datatype: date | ||
| dimension: | ||
| is_time: true | ||
| description: Date when order was placed | ||
|
|
@@ -290,6 +308,33 @@ expression: | |
| description: Normalized email address | ||
| ``` | ||
|
|
||
| ### Datatype and `is_time`: type vs. role | ||
|
|
||
| `datatype` and `dimension.is_time` are independent properties that answer different questions: | ||
|
|
||
| - **`datatype`** describes the *data type* of the field (e.g. `date`, `integer`, `string`, `timestamp_tz`): what kind of values the field holds. | ||
| - **`dimension.is_time`** is a *temporal-role marker*: whether the field should be treated as a time dimension for time-series analysis or temporal filtering, regardless of its data type. | ||
|
|
||
| **Default for `is_time`.** When `is_time` is not set explicitly, it defaults to `true` if `datatype` is one of `date`, `time`, `timestamp`, `timestamp_tz`, and `false` otherwise. Explicit `is_time` always wins. Set `is_time: false` on a temporal-typed column (e.g. an audit `created_at` you don't want on the time axis) to opt out of the default. | ||
|
|
||
| Common combinations: | ||
|
|
||
| | Column example | `datatype` | `is_time` | Effective role | Why | | ||
| |---|---|---|---|---| | ||
| | `d_date` (calendar date) | `date` | omitted | time dimension | Temporal `datatype`; `is_time` defaults to `true`. | | ||
| | `order_timestamp` | `timestamp_tz` | omitted | time dimension | Same. | | ||
| | `created_at` (audit timestamp) | `timestamp` | `false` | regular dimension | Explicit opt-out of the temporal default. | | ||
| | `d_year` (integer year grain) | `integer` | `true` | time dimension | Non-temporal `datatype`; `is_time: true` makes the role explicit. | | ||
| | `d_quarter_name` (e.g. `"Q1"`) | `string` | `true` | time dimension | String-valued temporal grain. | | ||
| | `customer_id` | `integer` | omitted | regular dimension | Non-temporal `datatype`; `is_time` defaults to `false`. | | ||
|
|
||
| > **Precedent.** This type/role separation mirrors [Snowflake Semantic Views' YAML authoring form](https://docs.snowflake.com/en/user-guide/views-semantic/semantic-view-yaml-spec), which has a structural `time_dimensions:` collection whose entries can carry any `data_type`. The published example annotates `order_year` with `data_type: NUMBER`. LookML supports a similar split via its [`dimension_group`](https://cloud.google.com/looker/docs/reference/param-field-dimension-group), whose `datatype` enum covers `date`, `datetime`, `timestamp`, plus the integer-encoded forms `epoch` and `yyyymmdd`. | ||
|
|
||
| **Consumer guidance.** | ||
|
|
||
| - For *data-type* questions (casting, serialization, downstream type inference): prefer `datatype` when present. If only `is_time: true` is set, do not infer a specific scalar type from it. | ||
| - For *role* questions (classifying time dimensions in a query UI, generating time-series output sections, choosing time-aware aggregations): treat the field as a time dimension when `is_time` resolves to `true`, whether explicitly set or defaulted from a temporal `datatype`. | ||
|
|
||
| --- | ||
|
|
||
| ## Metrics | ||
|
|
@@ -303,6 +348,7 @@ Quantitative measures defined on business data, representing key calculations li | |
| | `name` | string | Yes | Unique identifier for the metric | | ||
| | `expression` | object | Yes | Expression definition with dialect support | | ||
| | `description` | string | No | Human-readable description of what the metric measures | | ||
| | `datatype` | string (enum) | No | Logical data type for this metric. See [Datatypes](#datatypes). | | ||
| | `ai_context` | string/object | No | Additional context for AI tools (e.g., synonyms) | | ||
| | `custom_extensions` | array | No | Vendor-specific attributes | | ||
|
|
||
|
|
@@ -324,9 +370,11 @@ expression: | |
| ```yaml | ||
| - name: total_revenue | ||
| expression: | ||
| - dialect: ANSI_SQL | ||
| expression: SUM(orders.amount) | ||
| dialects: | ||
| - dialect: ANSI_SQL | ||
| expression: SUM(orders.amount) | ||
| description: Total revenue across all orders | ||
| datatype: number | ||
| ai_context: | ||
| synonyms: | ||
| - "total sales" | ||
|
|
@@ -338,9 +386,11 @@ expression: | |
| ```yaml | ||
| - name: avg_orders | ||
| expression: | ||
| - dialect: ANSI_SQL | ||
| expression: SUM(orders.amount) / COUNT(DISTINCT customers.id) | ||
| dialects: | ||
| - dialect: ANSI_SQL | ||
| expression: SUM(orders.amount) / COUNT(DISTINCT customers.id) | ||
| description: Average orders | ||
| datatype: number | ||
| ai_context: | ||
| synonyms: | ||
| - "Order Average by customer" | ||
|
|
@@ -446,6 +496,7 @@ semantic_model: | |
| dialects: | ||
| - dialect: ANSI_SQL | ||
| expression: order_date | ||
| datatype: date | ||
| dimension: | ||
| is_time: true | ||
| description: Order date | ||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One concern here is that using a generic number type isn’t really a valid interoperability strategy. Different data warehouses interpret and implement it differently, so it’s not interchangeable in practice.
For example, you can’t take a model built against data in Snowflake and expect it to behave identically in Databricks just because both use a number type. As we’ve already discussed in other issues/threads, precision and scale vary across systems, and that directly affects results.
Without explicitly specifying those parameters, this approach is likely to introduce subtle bugs. If strict type checking is enforced and there’s no automatic conversion, things will simply break. Even with implicit or explicit casting at the physical layer, you can still end up with precision loss, which is not acceptable for many use cases.
So I think we need a different approach here. Either we make the type definition explicit (e.g., always specifying precision and scale for numeric types), or we introduce some abstraction that preserves correctness across backends instead of relying on a loosely defined number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback @KSDaemon. The frame of reference I'm coming from is that these types need to be precise enough to generate SQL on top of (e.g. a BI tool building on top of the semantic definitions). In my experience, the SQL required for integers vs float/decimal can differ for warehouses that use truncation semantics for integers. I have not come across cases where I've needed to generate different SQL depending on whether the underlying type is a float or decimal.
But it is worth taking a step back to ask what else folks would rely on these logical data types for. Do you have any particular use cases in mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following up. Are there cases beyond the generation of SQL for semantic queries that you have in mind to support here? I'm happy to add decimal as data type, but I'd like to understand a motivating scenario for how the spec would be used in such a way that
numberanddecimalwould be treated differently.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using JSON Schema data types (https://json-schema.org/understanding-json-schema/reference/type) as a minimum would be a good idea.