Databricks Vibe Modelling Agent

Generate production-grade enterprise data models from natural language

Describe your business. Get a data model. Vibe it until it's perfect.

Concepts · Getting Started · Widget Reference · Vibing Workflow · Action Catalog · Troubleshooting

Documentation
What Is Vibe Modelling?
Concepts
Getting Started
- Quick Start Recipes
Operations
The Vibing Workflow
Widget Reference
- Core Widgets (01–11)
- Convention Widgets (12–24)
Auto-Generated Next Vibes
The Complete Action Catalog
Pipeline Stages
Output Artifacts
Quality Rules & Enforcement
LLM Architecture
Metric Views
Troubleshooting
Glossary

📚 Documentation

Document	Description
docs/	Documentation index — whitepaper, design guide, integration guide
docs/design-guide.md	Technical design reference
docs/integration-guide.md	UI/consumer integration protocol
docs/whitepaper.md	Philosophy and complete rules catalog
runner/readme.md	Pipeline orchestrator guide
tests/readme.md	Test suite reference

What Is Vibe Modelling?

Vibe Modelling is a Databricks-native, LLM-powered approach to generating enterprise data models from natural language. Instead of manually drawing ER diagrams, writing DDL, or importing pre-built industry templates, you describe your business in plain English and the agent builds a complete, production-grade data model — domains, tables, columns, foreign keys, tags, sample data, and documentation — end to end.

The name "Vibe" reflects the core workflow:

Generate a base model → review → vibe it with natural language → repeat → deploy

Each iteration produces a new version. The agent carries forward your context so nothing is lost between runs. You are never locked into a static template — the model evolves with your business.

📖 Concepts

Industry Data Models vs. Business Data Models

What is an Industry Data Model?

An industry data model is a generic, one-size-fits-all template designed for an entire vertical — retail, banking, healthcare, telecoms, etc. Organizations like the TM Forum (telecoms), ARTS (retail), ACORD (insurance), and HL7 (healthcare) publish canonical schemas that attempt to cover every conceivable entity and relationship across the industry.

The problem: When you adopt an industry model, you inherit everything — including the 60–80% of tables that your business will never use. You then spend months trimming, renaming, and reshaping the model to fit your actual processes.

What is a Business Data Model?

A business data model is tailored, contextualized, and specific to YOUR organization. It reflects your actual business processes, your product lines, your org structure, your regulatory environment, your terminology, and your governance requirements.

Vibe Modelling generates business data models. The LLM understands your industry deeply (via the complexity tier system), but the model it produces is shaped entirely by your context.

Aspect	Industry Data Model	Business Data Model (Vibe)
Scope	Entire industry vertical	Your specific business
Customization	Post-delivery (manual pruning)	Built-in (LLM-driven from your context)
Relevance	20–40% directly applicable	90–100% directly applicable
Time to production	Months of adaptation work	Hours with iterative vibing
Naming	Committee-standard naming	Your business terminology and conventions
Evolution	New version = re-adoption	Vibe the next version from the previous one
Cost	License fees + adaptation labor	Compute cost of LLM generation

The Vibe Philosophy

1. GENERATE  →  Describe your business, get a base model
2. VIBE IT   →  Review output, provide natural-language refinements
3. REPEAT    →  Each iteration = new version; agent auto-suggests next vibes
4. DEPLOY    →  Physical Unity Catalog schemas, tables, FKs, tags, sample data

The Four-Level Hierarchy: Divisions → Domains → Products → Attributes

Every Vibe model follows a strict four-level hierarchy:

graph TD
    B["🏢 BUSINESS<br/><i>e.g., Contoso Manufacturing</i>"]
    B --> D1["⚙️ Operations Division"]
    B --> D2["💼 Business Division"]
    B --> D3["🏛️ Corporate Division"]
    
    D1 --> DOM1["📦 logistics"]
    D1 --> DOM2["🏭 production"]
    D1 --> DOM3["📋 inventory"]
    
    D2 --> DOM4["👤 customer"]
    D2 --> DOM5["💰 sales"]
    D2 --> DOM6["🧾 billing"]
    
    D3 --> DOM7["👥 hr"]
    D3 --> DOM8["📊 finance"]
    
    DOM5 --> P1["🗂️ order"]
    DOM5 --> P2["🗂️ order_item"]
    DOM5 --> P3["🗂️ quote"]
    
    P1 --> A1["🔑 order_id <i>(PK, BIGINT)</i>"]
    P1 --> A2["📅 order_date <i>(DATE)</i>"]
    P1 --> A3["🔗 customer_id <i>(FK → customer.customer)</i>"]
    P1 --> A4["💲 total_amount <i>(DECIMAL)</i>"]
    P1 --> A5["📌 status <i>(STRING)</i>"]

Level 1: Divisions

Divisions are the top-level organizational grouping. Every domain belongs to exactly one division.

Division	Purpose	Typical Domains
Operations	Core operational backbone — mechanisms, infrastructure, processes	Logistics, production, inventory, service delivery, supply chain, quality control
Business	Revenue-generating and customer-facing functions	Customer/party, billing/revenue, product catalog, sales, subscriptions
Corporate	Supporting functions for governance (NOT directly revenue-generating)	HR, finance, legal/compliance, marketing, procurement, IT

Division Balance Rules:

Operations + Business MUST be >= 80% of all domains

Corporate is capped at <= 20% of total domains

No Corporate domain allowed until Operations AND Business each have >= 2 domains

Level 2: Domains

A domain is a logical grouping of related data products (tables). When deployed, each domain maps 1:1 to a Unity Catalog schema.

Named in snake_case, exactly one word (e.g., customer, fulfillment, logistics)
Count depends on model scope (MVM vs ECM) and industry complexity tier
The shared domain is reserved — the pipeline auto-creates it during SSOT consolidation; never create it manually

Level 3: Products (Tables)

A product is a data table within a domain — a first-class business entity with its own identity, lifecycle, and attributes.

Every product gets a PK column: <product_name>_<pk_suffix> (default: <product_name>_id)
Classified as CORE (entities stakeholders query directly) or HELPER (supporting entities)
Tagged by data type: master_data | transactional_data | reference_data | association_table

The First-Class Entity Test: A product must have its own identity, its own lifecycle, and at least 5 unique business attributes. Anything less should be an attribute on another table or merged.

Level 4: Attributes (Columns)

Each attribute has:

Property	Example	Purpose
`attribute`	`customer_email`	Column name (snake_case)
`type`	`STRING`	Apache Spark SQL data type
`tags`	`restricted,pii_email`	Classification + PII tags
`value_regex`	`^[a-zA-Z0-9._%+-]+@`	Validation pattern or enum
`business_glossary_term`	`Customer Email Address`	Human-readable business name
`description`	`Primary contact email for the customer`	What this attribute represents
`reference`	`GDPR Article 6`	Regulatory or standard reference
`foreign_key_to`	`customer.customer.customer_id`	FK target (`domain.product.pk`), or empty

DAG Enforcement — No Circular Dependencies

FK relationships MUST form a Directed Acyclic Graph (DAG):

✅ VALID:   order → customer → address
❌ INVALID: order → customer → address → order  (cycle!)

Why it matters: Circular dependencies prevent clean data loading order, break ETL pipelines, and indicate modeling errors.

How the agent enforces it:

Detect — Python DFS (Depth-First Search) cycle detection during QA
Break — LLM Cycle Break specialist determines which FK to remove (always the weakest link)
Verify — Re-run DFS to confirm graph is now a valid DAG
Iterate — Up to 5 rounds to resolve all cycles, including residual ones

Allowed exception: Self-referencing hierarchical FKs (e.g., parent_category_id → category.category_id) are permitted — they represent tree structures, not multi-node cycles.

Single Source of Truth (SSOT)

Each core business concept has exactly ONE authoritative domain and ONE authoritative product (table) that owns it. No concept is duplicated across domains.

What SSOT prevents:

Two customer tables in different domains
A product table in both sales and inventory
An invoice table in both billing and finance

How SSOT is enforced:

Phase	Mechanism
Generation	LLM places each entity in its authoritative domain — the domain that CREATES/OWNS that data
QA Deduplication	Global pass detects same-name products + synonym pairs (e.g., `customer` vs `client`) with 60%+ attribute overlap
Consolidation	Overlapping products merged into `shared` domain with discriminator column
Cross-Domain References	Other domains use FK columns to reference the authoritative table — no duplication

The "Where Do I Go?" Test: For any entity, ask: "If a business user needs this information, is there exactly ONE place to get it?" If the answer is ambiguous, there is an SSOT violation.

Model Scopes: MVM vs. ECM

	MVM (Minimum Viable Model)	ECM (Expanded Coverage Model)
Size	30–50% of ECM table count	Full coverage
Attribute depth	SAME as ECM — full production-grade (same min/max per tier)	Full production-grade
Domains	Essential business functions only	All functions including corporate back-office
Ideal for	SMBs, rapid deployments, POCs, dev/test	Fortune 100, multinational enterprises
Lightness	Fewer domains & tables (NEVER thinner attributes)	Maximum breadth

MVM is NOT a skeleton or demo toy. It is a production-ready subset where every delivered table is fully-featured.

Subdomains

Every domain is organized into subdomains — semantic groupings of related products within a domain. Subdomains provide an additional organizational layer between domains and products.

Subdomain Rules

Rule	Description
Count	Minimum and maximum subdomains per domain are defined by the complexity tier. Never exactly 1 subdomain per domain.
Naming	EXACTLY 2 words per subdomain name. 1 word = rejected. 3+ words = rejected.
Min Products	Every subdomain must contain at least the minimum products per subdomain defined by the tier: ECM tier_1–tier_4: 3, ECM tier_5: 2; MVM tier_1–tier_3: 3, MVM tier_4–tier_5: 2.
No Overlap	No two subdomains within the same domain may share any word in their names.
Business Terms	Use business terminology, NOT technical terms.
Balanced	Products distributed as evenly as possible across subdomains.
No Placeholders	NEVER use: "Sub Domain1", "Category 1", "Group A", "N/A", "Other", "General", "Miscellaneous", "Uncategorized".
No Drift	Each subdomain belongs to exactly one parent domain.

Example: In the party domain:

identity: individual, organization, party_identification, kyc_verification
engagement: party_interaction, consent_record, loyalty_enrollment

Industry Complexity Tiers

The agent auto-classifies your business into one of five tiers:

Classification is based on 7 scoring dimensions (scored 0 or 1 each):

Regulatory density — 3+ distinct regulatory bodies imposing data/reporting requirements
Party complexity — 3+ distinct party types (customers, suppliers, partners, employees, etc.)
Product hierarchy depth — 50+ product variants with complex bundling/pricing/lifecycle
Infrastructure management — Owns physical or digital infrastructure requiring asset tracking
Industry canonical model — 200+ entity types defined by an industry standards body
Transaction complexity — 10+ distinct transaction types with multi-step lifecycles
Operational system landscape — 5+ major systems of record across business functions

Rule: If the business falls between two tiers, classify UP (prefer the more complex tier).

Tier	Label	Hallmarks	ECM Domains	ECM Products/Domain	Attrs/Product	Subdomains/Domain
`tier_1`	Ultra-Complex	5+ regulatory bodies, multi-entity structures (banking, insurance, pharma)	15–22	14–28	15–50	3–6
`tier_2`	Complex	2–4 regulatory bodies, multi-channel (telecoms, energy, healthcare)	12–18	14–26	12–50	2–5
`tier_3`	Moderate	1–2 regulatory bodies, 3+ business functions (manufacturing, retail)	10–15	12–24	10–45	2–5
`tier_4`	Standard	Light regulation, regional complexity (logistics, agriculture)	8–12	10–20	10–40	2–4
`tier_5`	Simple	Minimal regulation, service-based (consulting, SaaS, media)	5–8	8–18	8–35	2–4

MVM Tier Sizing (attribute depth is the SAME as ECM):

Tier	MVM Domains	MVM Products/Domain	Subdomains/Domain
`tier_1`	9–14	8–16	2–4
`tier_2`	8–12	8–14	2–4
`tier_3`	6–10	7–13	2–4
`tier_4`	5–8	6–11	2–3
`tier_5`	3–6	5–10	2–3

MVM counts are automatically derived as ~30–50% of the ECM counts for each tier. Attribute depth (min/max attributes per product) is the same for both MVM and ECM within each tier.

🚀 Getting Started

Quick Start Recipes

Recipe 1: Generate Your First Model (Minimal Input)

Widget	Value
01. Business	`My Company Name`
02. Description	`A brief description of what your company does`
03. Operation	`new base model`
05. Model Scope	`Minimum Viable Model - MVM`
09. Installation Catalog	`my_catalog` (or leave blank for logical-only)
Everything else	Defaults

Run the notebook. Done.

Recipe 2: Generate a Rich Model (Recommended)

Widget	Value
01. Business	`Contoso Manufacturing`
02. Description	`A multinational aluminum smelting company operating across 12 countries with 15,000 employees`
03. Operation	`new base model`
05. Model Scope	`Expanded Coverage Model - ECM`
06. Business Domains	`production, quality, supply, customer, sales, logistics, billing`
07. Org Divisions	`Operations, Business and Corporate`
09. Installation Catalog	`contoso_dev`
10. Sample Records	`10`

Recipe 3: Vibe an Existing Model

Widget	Value
01. Business	`Contoso Manufacturing`
02. Description	(same as before)
03. Operation	`vibe modeling of version`
04. Version	`1`
08. Model Vibes	`Add a warranty domain. Remove the corporate_strategy domain. Run quality checks.`

Recipe 4: Use Auto-Generated Next Vibes

After any pipeline run, find next_vibes.txt in the vibes/ output folder. Copy the suggested vibes into widget 08.

Widget	Value
01. Business	`Contoso Manufacturing`
02. Description	(same as before)
03. Operation	`vibe modeling of version`
04. Version	Next version number (previous + 1)
08. Model Vibes	(paste recommended vibes from next_vibes.txt)

Recipe 5: Deploy to a New Catalog

Widget	Value
03. Operation	`install model`
09. Installation Catalog	`prod_catalog`
11. Model JSON File	`/Volumes/catalog/schema/volume/vibes/contoso/v1_ecm/model.json`
10. Sample Records	`0` (or a number for test data)

⚙️ Operations

The 03. Operation widget selects which pipeline mode to run:

Operation	Purpose	Key Requirements
`new base model`	Generate a brand-new data model from scratch	Business name + description (via widgets)
`vibe modeling of version`	Apply natural-language instructions to refine an existing version	Version + vibes (widget 08)
`shrink ecm`	Convert an ECM to a leaner MVM	Version + deployment catalog
`enlarge mvm`	Expand an MVM into a comprehensive ECM	Version + deployment catalog
`install model`	Deploy a logical model into physical Unity Catalog objects	Model JSON file (widget 11) + deployment catalog
`uninstall model version`	Remove a version's physical artifacts from the catalog	Business name + version + catalog
`generate sample data`	Generate synthetic sample records for an existing deployed model	Model JSON file (widget 11) + deployment catalog

🎵 The Vibing Workflow

This is the core power of Vibe Modelling — iterative refinement via natural language:

graph LR
    A["🆕 New Base Model<br/>Version 1"] --> B["🔍 Review Output"]
    B --> C["📝 Write Vibes<br/><i>or use auto-generated next_vibes.txt</i>"]
    C --> D["🎵 Vibe Modeling<br/>Version 2"]
    D --> E["🔍 Review"]
    E --> F["🎵 Vibe v3"]
    F --> G["..."]
    G --> H["✅ Deploy"]

What Vibes Can Do

Vibes are free-form natural language. The agent interprets them and translates them into specific actions:

"Add a compliance domain with regulatory_filing and audit_trail tables"
"Remove the HR domain, we don't need it"
"Merge the customer_support domain into the customer domain"
"Add a source_system column to every table"
"Run a full quality check and fix any issues found"
"The order table should have a shipping_address_id FK to the address table"
"Rename all tables starting with dim_ to remove that prefix"
"Normalize the order domain to 3NF"
"Mark all email columns as PII"
"Generate an ontology (RDF/RDFS) for the model"
"Keep only billing-related tables, drop everything else"
"Make this model ECM — I need the large version"

Auto-Generated Next Vibes

After every pipeline run, the agent produces next_vibes.txt and current_vibes.txt in the vibes/ folder, containing:

Your current business context (preserved as-is)
Recommended vibes for the next iteration (based on QA findings)
Model health metadata: confidence score, warning counts, issue breakdown
Version history and progression tracking

To use them: copy the recommended vibes into widget 08. Model Vibes, set operation to vibe modeling of version, set the next version, and run.

🎛️ Widget Reference

The notebook exposes 28 configurable widgets. Below is the complete reference.

Core Configuration Widgets (01–11)

#	Widget	Type	Mandatory	Default	Description
01	Business (name)	Text	Yes	—	Your business/organization name
02	Description	Text	Recommended	—	What your business does (richer = better model)
03	Operation	Dropdown	Yes	`new base model`	Pipeline operation to run
04	Version	Dropdown	Conditional¹	—	Model version number (1–100)
05	Model Scope	Dropdown	Yes	`Minimum Viable Model - MVM`	MVM (lean) or ECM (comprehensive)
06	Business Domains	Text	No	—	Comma-separated seed domains
07	Included Org Divisions	Dropdown	Yes	`Operations and Business`	Which divisions to include
08	Model Vibes	Text	Conditional²	—	Natural language instructions — inline (max 2,000 chars) or file path to a `.txt` on a UC Volume
09	Installation Catalog	Text	Conditional³	—	Unity Catalog for physical deployment
10	Sample Records	Dropdown	No	`0`	Synthetic records per table (0 = none)
11	Model JSON File	Text	Conditional⁴	—	Path to a previously generated `model.json` for re-install or continuation

¹ Required for all operations except new base model (auto-assigned). ² Required for vibe modeling of version. ³ Required for install, uninstall, generate sample data, shrink, enlarge. Optional for new base model and vibe modeling. ⁴ Required for install model, generate sample data when re-deploying a previously generated model.

Detailed Widget Descriptions (click to expand)

01. Business (name)

The name of your business. Used as the top-level identifier across the entire model. Case-insensitive matching (stored via LOWER()).

Sample values: Contoso Inc, Acme Healthcare, Global Telecom Corp, NextGen Retail

02. Description

A rich description of what your business does. Include industry, size, geography, key products/services. The LLM uses this to determine the industry complexity tier and tailor the model.

Sample values: A multinational aluminum smelting and manufacturing company operating across 12 countries, A digital-first healthcare provider specializing in telemedicine

03. Operation

See the Operations section for full details on each mode.

04. Version

Options: Empty, or 1–100

For new base model, leave empty — auto-assigned as 1 (or auto-incremented). For vibe modeling of version, this is the version you are modifying; output creates version N+1.

05. Model Scope

Options: Minimum Viable Model - MVM | Expanded Coverage Model - ECM

MVM = lean core (fewer domains/tables, same attribute depth). ECM = comprehensive Fortune 100 coverage.

06. Business Domains

Comma-separated list of specific domains you want. If blank, the LLM auto-generates the optimal set for your industry.

Sample values: customer, sales, billing, inventory, logistics | clinical, pharmacy, research

07. Included Org Divisions

Options: Operations | Operations and Business | Operations, Business and Corporate

Controls which organizational divisions contribute domains to the model.

08. Model Vibes

Supports two input modes:

Inline text — type your vibe instructions directly into the widget (max 2,000 characters). Best for short, targeted changes.
File path — provide a path to a .txt file on a Unity Catalog Volume (e.g., /Volumes/catalog/schema/vol/my_vibes.txt). Best for longer, multi-paragraph instructions that exceed the inline limit.

See The Vibing Workflow for examples of what vibes can do.

09. Installation Catalog

The Unity Catalog where physical schemas, tables, FK constraints, tags, and sample data will be created. Must already exist. You need CREATE SCHEMA privileges. If blank for new base model/vibe modeling, only the logical model (JSON artifacts) is produced.

Sample values: dev_catalog, prod_data_models, contoso_lakehouse

10. Sample Records

Options: 0, 5, 10, 15, 20, 25, 50, 100

0 = skip sample data generation. 10 = good default for review. 50–100 = for load testing / demos. Sample data respects FK relationships (child records reference valid parent IDs).

11. Model JSON File

(Optional) Path to a previously generated model.json file on a Unity Catalog Volume. Used when re-installing or continuing from a prior run. After every run, the agent generates next_vibes.txt in the vibes/ folder — review it for recommended next vibe instructions.

Sample values: /Volumes/my_catalog/my_schema/vol/vibes/contoso/v1_ecm/model.json

Model Convention Widgets (12–24)

#	Widget	Type	Default	Options / Format
12	Naming Convention	Dropdown	`snake_case`	`snake_case`, `camelCase`, `PascalCase`, `SCREAMING_CASE`
13	Primary Key Suffix	Dropdown	`_id`	`_id`, `_key`, `_pk`, `id`, `key`
15	Schema Prefix	Text	(empty)	e.g., `stg_`, `raw_`, `dw_`
16	Tag Prefix	Text	`dbx_`	e.g., `dbx_`, `vibe_`, `mdl_`
17	Table ID Type	Dropdown	`BIGINT`	`BIGINT`, `INT`, `LONG`, `STRING`
18	Boolean Format	Dropdown	`Boolean (True/False)`	`Boolean (True/False)`, `Int (0/1)`, `String (Y/N)`
19	Date Format	Dropdown	`yyyy-MM-dd`	`yyyy-MM-dd`, `dd/MM/yyyy`, `MM/dd/yyyy`, `yyyy/MM/dd`, `dd-MM-yyyy`
20	Timestamp Format	Dropdown	`yyyy-MM-dd'T'HH:mm:ss.SSSXXX`	4 ISO/standard options
21	Classification Levels	Text	`restricted=restricted, confidential=confidential, internal=Internal, public=public`	Comma-separated `key=Label` pairs
22	Housekeeping Columns	Dropdown	`No`	`No`, `Yes` — adds `created_by`, `created_at`, `updated_by`, `updated_at`
23	History Tracking Columns	Dropdown	`No`	`No`, `Yes` — adds `valid_from`, `valid_to`, `is_current` (SCD Type 2)
09a	Cataloging Style	Dropdown	`One Catalog`	`One Catalog`, `Catalog per Division`, `Catalog per Domain`
09b	Catalog Prefix	Text	(empty)	e.g., `dev_`, `prod_`
09c	Catalog Suffix	Text	(empty)	e.g., `_lakehouse`, `_dw`
15a	Schema Suffix	Text	(empty)	e.g., `_db`, `_schema`
16a	Tag Suffix	Text	(empty)	e.g., `_tag`
24	Vibe Session ID	Text	(empty)	UUID for external UI progress tracking

Widget #14 does not exist (numbering gap between 13 and 15).

📄 Auto-Generated Next Vibes

After every run, the agent produces next_vibes.txt in the vibes/ output folder with recommended next vibe instructions. Copy these into widget 08. Model Vibes for the next iteration.

The model.json file includes session metadata:

{
  "_next_vibe_metadata": {
    "generated_from_version": "v1_mvm",
    "model_version": "1",
    "status": "needs_work",
    "confidence_score": 78,
    "summary": "Model has 3 unlinked columns and 1 siloed table",
    "issues_addressed": ["..."],
    "issues_not_addressed": ["..."],
    "data_modeler_notes": "Recommend adding a warehouse domain",
    "model_stats_at_generation": { "domains": 8, "products": 47, "attributes": 523 },
    "issue_counts": { "error": 0, "warning": 3, "info": 5 },
    "version_history": ["..."]
  }
}

🎯 The Complete Action Catalog

Vibes are translated into 190+ specific actions organized into categories. You never need to name these actions — just describe what you want in natural language and the agent maps your intent.

Entity Management (1–25)

#	Action	What It Does
1	`drop`	Remove a domain, product, attribute, tag, or link
2	`create`	Add a new domain, product, or attribute
3	`rename`	Change the name of an entity
4	`alter_description`	Modify the description
5	`change_type`	Change an attribute's data type
6	`merge`	Combine two entities into one
7	`split`	Divide one entity into multiple
8	`move_product`	Move a product to another domain
9	`move_attribute`	Move an attribute to another product
10	`delete_attribute`	Remove an attribute
11	`create_link`	Create a foreign key relationship
12	`drop_link`	Remove a foreign key relationship
13–17	Prefix/suffix ops	Add, remove, or change name prefixes/suffixes
18–24	Metadata ops	Update glossary terms, references, regex, tags
25	`modify`	Regenerate an entity with specific guidance

Quality Check & Analysis (26–38)

#	Action	What It Does
26	`run_quality_checks`	Compound: runs detect_duplicates + dedupe_attributes + detect_cycles + detect_siloed + fix_fk_anomalies
26b	`run_product_domain_fit`	LLM audit: are products in the right domains? Relocates misplaced ones.
27	`detect_duplicates`	Find SSOT violations (semantic duplicate products across domains)
28	`fix_duplicates`	Auto-merge/remove detected duplicates
29	`dedupe_attributes`	Remove duplicate columns within products
30	`detect_cycles`	Find circular FK dependencies
31	`break_cycles`	Auto-fix cycles by removing weakest FK links
32	`detect_siloed`	Find completely disconnected tables (no FKs in any direction)
33	`fix_siloed`	Connect disconnected tables via linking attempts
34	`review_links`	Audit all FK relationships for anomalies
35	`fix_fk_anomalies`	Repair broken, mismatched, or orphaned FK references
36	`fix_ambiguous_fks`	Resolve FKs that match multiple target tables
37	`merge_small_tables`	Analyze and consolidate tables with < 5 attributes
38	`identify_core_products`	Identify business-critical products for protection
95	`model_checkup`	Mega-compound: runs static analysis + auto-queues ALL appropriate fixes

FK & Linking Operations (39–46, 133–138)

#	Action	What It Does
39	`run_linking`	Compound: in-domain + cross-domain + M:N detection
40	`run_in_domain_linking`	Link FKs within each domain
41	`run_cross_domain_linking`	Link FKs across different domains
42	`detect_many_to_many`	Find potential M:N relationships
43	`create_junction_tables`	Create bridge/association tables for M:N
45	`redirect_fk`	Redirect FK to a different target table
46	`find_unlinked_columns`	Find *_id columns without FK relationships
133	`remove_product_prefix`	Remove redundant table-name prefix from columns
134	`fix_fk_column_naming`	Fix FK columns that don't end with the target PK name
135	`connect_table`	Connect a specific disconnected table
136	`link_specific_columns`	Link an explicit list of unlinked _id columns
138	`find_missing_fk_links`	Comprehensive: LLM classifies each unlinked column as LINK, CREATE, DROP, or KEEP_AS_IS

Bulk Operations (62–64, 110–114)

#	Action	What It Does
62	`bulk_rename_products`	Rename multiple products by pattern
63	`bulk_drop_products`	Drop products matching a pattern (e.g., `stg_*`)
64	`bulk_move_products`	Move products matching a pattern to another domain
110	`bulk_change_type`	Change data type for attributes by pattern
111	`bulk_set_nullable`	Set nullable for attributes by pattern
113	`bulk_remove_attributes`	Remove attributes matching a pattern from all tables

Tag & Classification Operations (54–61, 98–106, 156–163)

#	Action	What It Does
54–57	Tag add/remove	Add or remove tags from products or domains
58	`conditional_tag`	Tag entities matching a condition
59–60	Bulk tagging	Tag by name pattern or by column presence
98–101	PII/sensitive marking	Mark as PII, sensitive, encrypted, deprecated
103	`set_table_type`	Classify as dimension, fact, lookup, bridge, staging, archive
156	`classify_table_tier`	Medallion architecture: bronze / silver / gold
161	`set_data_owner`	Assign data owner/steward
162	`set_update_frequency`	Document expected freshness (real_time, daily, monthly, etc.)
163	`map_to_source_system`	Map table/column to source system (SAP, Salesforce, etc.)

Template Column Actions (148–155)

#	Action	Columns Added
148	`add_scd_columns`	`effective_from`, `effective_to`, `is_current`, `row_hash`
149	`add_audit_columns`	`created_at`, `updated_at`, `created_by`, `updated_by`
150	`add_soft_delete_columns`	`is_deleted`, `deleted_at`, `deleted_by`
151	`add_temporal_columns`	`valid_from`, `valid_to`, `system_from`, `system_to`
152	`add_versioning_columns`	`version_number`, `version_valid_from`, `version_valid_to`, `is_latest_version`
153	`add_multitenancy_columns`	`tenant_id`
154	`add_lineage_columns`	`source_system`, `source_table`, `ingestion_timestamp`, `etl_job_id`
155	`add_gdpr_columns`	`consent_status`, `consent_date`, `data_subject_request_id`, `right_to_erasure_date`

Structural Transformations (164–191)

#	Action	What It Does
164	`normalize_to_3nf`	Apply Third Normal Form normalization (LLM-powered)
165	`denormalize_for_analytics`	Create wide/denormalized tables for BI
184	`promote_to_table`	Extract an attribute into its own lookup/reference table + FK
185	`inline_table`	Merge a child table into its parent (denormalize)
186	`swap_domains`	Atomically swap two domain names
187	`impact_analysis`	Show what would break if a table/domain were dropped (read-only)
190	`enlarge_model`	Wholesale expansion from MVM to ECM scope
191	`shrink_model`	Wholesale reduction from ECM to MVM scope
178	`VIBE_PRUNE_PROMPT`	LLM-powered: keep only tables related to a focus area
179	`drop_domains_except`	Drop all domains except a specified keep-list

Artifact Generation (139–147)

#	Action	Output
139	`generate_readme`	README documentation
140	`generate_data_model_json`	Complete model as JSON
141	`generate_ontology`	RDF/RDFS ontology (Turtle format)
142	`generate_dbml`	DBML schema for visualization tools
143	`generate_release_notes`	Changelog/release notes for the version
144	`generate_excel`	Excel workbook export
145	`generate_data_dictionary`	Comprehensive data dictionary
146	`export_model_report`	Full model documentation report
147	`generate_test_cases`	Data quality test case specifications

Metric View Operations (126–134)

#	Action	What It Does
126	`run_metric_modeling`	Generate Databricks metric views (dimensions + measures)
129	`add_metric_measure`	Add a specific KPI measure to metric views
130	`remove_metric_measure`	Remove a measure
131	`add_metric_dimension`	Add a grouping dimension
132	`remove_metric_dimension`	Remove a dimension
133	`alter_metric_filter`	Set/update metric view filter logic
134	`drop_metric_view`	Remove an entire metric view

🔄 Pipeline Stages

When you run the agent, it executes these progress stages in order (the # column is for reference only — no numeric stage IDs are emitted in progress events):

#	Stage	Duration	What Happens
1	Setup and Configuration	2–10s	Validates inputs, creates metamodel tables
2	Interpreting Instructions	10–30s	(Vibe mode only) Parses vibes into structured action plan
3	Collecting Business Context	10–30s	LLM enriches your description across business/industry dimensions
4	Designing Domains	15–60s	Generates domains following division model + SSOT
5	Creating Data Products	1–10m	Products per domain with architect review
6	Enriching Data Products with Attributes	5–40m	All columns for every product
7	Cross-Domain Linking	1–5m	In-domain → global sweep → pairwise comparison
8	Quality Assurance	30s–3m	9 sub-checks (naming, PK/types, overlaps, topology, auto-remediation)
9	Applying Naming Conventions	10–30s	Final naming consistency pass
10	Model Finalization	10–30s	Finalize logical model snapshot before physical deployment
11	Subdomain Allocation	10–30s	Allocates products into semantic subdomains
19	Generating Metric View Artifacts	10–30s	Exports metric view definitions (legacy stage ID)
17	Generating Artifacts	30s–2m	README, Excel/CSV, model JSON, data dictionary, model report
12	Physical Schema Construction	1–10m	Creates UC schemas + Delta tables (if catalog set)
13	Applying Foreign Keys	30s–2m	FK constraints on physical tables
14	Applying Tags	2–15m	Classification, PII, data type tags on schemas/tables/columns
15	Applying Metric Views	30s–5m	Databricks metric views for KPI tracking
16	Generating Sample Data	1–15m	Synthetic records respecting FK relationships
18	Consolidation and Cleanup	10–30s	Consolidates/merges metadata and cleanup

📦 Output Artifacts

Logical Artifacts (Always Generated)

File	Description
`model.json`	Complete model export (primary output)
`readme.md`	Human-readable model documentation
`vibes/current_vibes.txt`	Vibe instructions used in this run
`vibes/next_vibes.txt`	Recommended next vibe instructions
`domains.json`	All domain definitions
`products.json`	All product (table) definitions
`attributes.json`	All attribute (column) definitions
`docs/.xlsx` / `docs/.csv`	Excel/CSV export of the entire model (CSV fallback if openpyxl unavailable)
`readme.md` (parent folder)	Model overview document comparing MVM and ECM scopes
`schemas/*.sql`	SQL DDL files (per-domain schemas, cross-domain FKs, catalogs)
`diagram/_dbml_.txt`	DBML schema diagram
`ontology/_rdf_.ttl`	RDF/Turtle ontology representation
`docs/releasenotes.txt`	Auto-generated release notes

Conditional Artifacts (Generated on Demand via Queued Operations)

File	Description
`docs/_data_dictionary_.txt`	Column-level reference guide
`docs/_model_report_.txt`	Full documentation report
`docs/_test_cases_.txt`	Generated test cases

Physical Deployment (When Catalog Is Set)

Object	Example
Schemas	`catalog.customer`, `catalog.sales`, `catalog.logistics`
Tables	Delta tables with all columns and correct Spark SQL types
FK Constraints	Physical foreign key constraints between tables
Tags	Unity Catalog tags on schemas, tables, and columns
Metric Views	Databricks metric views for KPI calculation
Sample Data	Synthetic records with valid FK references

🛡️ Quality Rules & Enforcement

The agent enforces a comprehensive rule system during generation and QA:

Rule Group	Key Rules
Naming (G01)	snake_case default; domains: 1 word, singular, max 20 chars; products: 1–3 words, max 30 chars, no domain prefix; attributes: max 50 chars, no product prefix (except PK); FKs must end with target PK name; preserve unit qualifiers
Semantic Dedup (G02)	First-Class Entity Test (5 criteria); SSOT violation detection; 60%+ overlap → merge to shared; 90%+ overlap same domain → remove; attribute dedup >80% confidence; semantic distinction rules (method vs channel, target vs actual, lifecycle timestamps)
FK Integrity (G03)	FK target must exist; no bidirectional FKs; DAG required (no cycles); FK type compatibility; system identifiers exempted; hierarchical self-refs exempt
PK Rules (G04)	Every product has exactly 1 PK; PK = {product}_{suffix}; PK type = configured type (default BIGINT); PK exempt from prefix stripping
Normalization (G05)	3NF enforcement; orphaned FK detection; denormalized attribute detection with >95% confidence; point-in-time snapshots exempt; measurement attributes exempt; geographic coordinates on physical entities exempt
Division Balance (G06)	Ops + Business ≥ 80% of domains; Corporate ≤ 20%; no Corporate until Ops AND Business each have ≥ 2; cross-division relocation forbidden; forbidden generic domain names; Org Chart Test; Fragmentation Test (30%+ overlap → merge)
Data Types (G07)	Spark SQL types only; no ARRAY/STRUCT/MAP; no calculated metrics/aggregates
Tags (G08)	Classification in tags only; PII = RESTRICTED + specific PII tag (pii_email, pii_phone, pii_financial, pii_health, pii_identifier, pii_address, pii_biometric, pii_name, pii_dob); empty tags for regular data; custom user tags always applied
Graph Topology (G09)	DAG enforcement; DFS cycle detection; hierarchical self-ref exemption; zero siloed tables; each domain ≥2 cross-domain connections; parent-child FKs never broken in cycle resolution
Honesty Check (G11)	0–100 scale; below 55 = permanently discarded; 55–70 = borderline retry; ≥90 = accepted; contradiction penalty applied post-processing
Product Design (G12)	M:N requires 3 indicators with ≥2 of 3 strong; association ratio ECM ≤15%, MVM ≤5%; core products 1–3 per domain; forbidden product suffixes (_analysis, _analytics, _report, _prediction); Silver Layer only (no analytics products)
Sample Data (G13)	Exact record count; sequential PK from 10001; FK random [10001, 10001+N-1]; regex compliance; 3-letter country codes; no Lorem Ipsum; realistic business data
Vibe Constraints (G14)	Dedup overlap thresholds; max relocation percentage per pass; normalization confidence 95%; mutation budgets by mode (surgical, holistic, generative); domain hard ceiling factor 1.5x
Physical Schema Deployment (G15)	Schema and table creation factories; attribute dict factory; product dict factory; FK dependency ordering; consolidation and cleanup
Subdomains	Exactly 2-word names; min products per subdomain per tier; no overlapping words; balanced distribution; no placeholder names

🤖 LLM Architecture

The agent uses a multi-model ensemble with automatic demotion and recovery:

Order	Role	Purpose	Endpoint	Input Tokens	Output Tokens
10	Thinker (large)	Complex reasoning, architecture reviews, QA decisions	`databricks-claude-opus-4-6`	200,000	128,000
20	Worker (large)	High-volume generation: products, attributes, FKs, dedup	`databricks-claude-sonnet-4-6`	200,000	64,000
30	Thinker (large)	Fallback thinker	`databricks-claude-opus-4-5`	200,000	64,000
40	Worker (large)	Fallback worker	`databricks-claude-sonnet-4-5`	200,000	64,000
50	Worker (small)	Simpler tasks: domain generation, tag classification	`databricks-gpt-oss-120b`	131,072	25,000
60	Worker (tiny)	Sample data generation	`databricks-gpt-oss-20b`	131,072	25,000

Automatic model demotion: After 3 cumulative failures on a model, the entire priority order is demoted — the failing model is pushed down and the next model takes its place. After 5 consecutive successes, the original order is restored. After 3 consecutive timeouts, a model is marked broken and skipped for the rest of the session.

Prompt architecture: 49 specialized prompts, each mapped to a specific model role and temperature setting. Thinker prompts use temperature 0 for deterministic reasoning. Worker prompts use temperature 0–0.3 for controlled generation. Sample generation uses temperature 0.5 for creative variety.

📊 Metric Views

The agent generates Databricks metric views — reusable KPI definitions that sit on top of your data tables:

Component	Description	Example
Dimensions	Grouping columns	`region`, `product_category`, `fiscal_quarter`
Measures	Single-aggregate expressions	`SUM(revenue)`, `COUNT(DISTINCT customer_id)`, `AVG(order_value)`
Filters	Row-level predicates	`WHERE status = 'completed'`

Metric views are auto-generated per domain, focusing on KPIs that would appear in executive dashboards and quarterly business reviews. Each measure uses a single aggregate function (nested aggregates like AVG(SUM(...)) are not supported by metric view YAML and are automatically prevented).

🔧 Troubleshooting

Symptom	Likely Cause	Resolution
"No deployment catalog specified — skipping physical model deployment"	Widget 09 is empty	Set deployment catalog to deploy physical tables
Model has too few domains	MVM scope + tier_5 classification	Switch to ECM, or provide seed domains in widget 06
Model has irrelevant domains	LLM inferred from description	Vibe: `"Remove the X domain"`
FK pointing to wrong table	LLM linked incorrectly	Vibe: `"Redirect order.warehouse_id FK to logistics.warehouse"`
Circular dependency warning	DAG violation	Agent auto-fixes during QA; if persistent: `"Break cycles"`
SSOT violation (duplicate entities)	Same concept in multiple domains	Vibe: `"Run quality checks"` or `"Fix duplicates"`
Pipeline crashed mid-run	LLM timeout or transient error	Re-run with same inputs — agent cleans up incomplete versions
Model JSON file parse error	Invalid JSON (smart quotes, trailing commas)	Validate JSON; use straight quotes only
"Version X exists but is incomplete"	Previous run failed	Agent auto-detects — just re-run
Model too large / too many tables	ECM + high complexity tier	Use `shrink ecm`, or vibe: `"Keep only core business tables"`
Convention changes not applied	Convention widgets not set	Update the model convention widgets (12–24) with your desired values before running
Metric views have `COUNT(1)` instead of real KPIs	Nested aggregates were auto-replaced	LLM prompt prevents this; re-run metrics: `"Regenerate metrics"`

📘 Glossary

Term	Definition
Attribute	A column in a product (table) — has name, type, tags, description
Business Data Model	A data model tailored to a specific organization, generated by Vibe Modelling
Corporate Division	Supporting functions (HR, Finance, Legal) that enable but don't directly generate revenue
DAG	Directed Acyclic Graph — the required topology for FK relationships (no cycles)
Division	Top-level organizational grouping: Operations, Business, or Corporate
Domain	A logical grouping of related tables, deployed as a Unity Catalog schema
ECM	Expanded Coverage Model — comprehensive enterprise-grade model scope
FK	Foreign Key — a column referencing another table's primary key
Honesty Check	LLM self-assessment score (0–100%) to ensure output quality
Industry Data Model	A generic, vendor-published schema template for an entire industry vertical
Junction Table	An association table resolving M:N relationships between two entities
Metric View	A Databricks KPI definition with dimensions, measures, and filters
MVM	Minimum Viable Model — lean, production-ready scope (30–50% of ECM)
PK	Primary Key — the unique identifier column for a table
Product	A data table within a domain — a first-class business entity
SSOT	Single Source of Truth — each business concept has one and only one authoritative owner
Tier	Industry complexity classification (tier_1 = Ultra-Complex to tier_5 = Simple)
Vibe	A natural language instruction provided to the agent to modify the model
Vibe Modelling	The iterative process of generating and refining data models using natural language

Built on Databricks Serverless Compute with Unity Catalog governance

Name		Name	Last commit message	Last commit date
Latest commit History 124 Commits
agent		agent
docs		docs
models		models
rules		rules
runner		runner
tests		tests
viewer		viewer
.gitignore		.gitignore
LICENSE		LICENSE
readme.md		readme.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation