Describe your business. Get a data model. Vibe it until it's perfect.
Concepts · Getting Started · Widget Reference · Vibing Workflow · Action Catalog · Troubleshooting
- Documentation
- What Is Vibe Modelling?
- Concepts
- Getting Started
- Operations
- The Vibing Workflow
- Widget Reference
- Auto-Generated Next Vibes
- The Complete Action Catalog
- Pipeline Stages
- Output Artifacts
- Quality Rules & Enforcement
- LLM Architecture
- Metric Views
- Troubleshooting
- Glossary
| Document | Description |
|---|---|
| docs/ | Documentation index — whitepaper, design guide, integration guide |
| docs/design-guide.md | Technical design reference |
| docs/integration-guide.md | UI/consumer integration protocol |
| docs/whitepaper.md | Philosophy and complete rules catalog |
| runner/readme.md | Pipeline orchestrator guide |
| tests/readme.md | Test suite reference |
Vibe Modelling is a Databricks-native, LLM-powered approach to generating enterprise data models from natural language. Instead of manually drawing ER diagrams, writing DDL, or importing pre-built industry templates, you describe your business in plain English and the agent builds a complete, production-grade data model — domains, tables, columns, foreign keys, tags, sample data, and documentation — end to end.
The name "Vibe" reflects the core workflow:
Generate a base model → review → vibe it with natural language → repeat → deploy
Each iteration produces a new version. The agent carries forward your context so nothing is lost between runs. You are never locked into a static template — the model evolves with your business.
What is an Industry Data Model?
An industry data model is a generic, one-size-fits-all template designed for an entire vertical — retail, banking, healthcare, telecoms, etc. Organizations like the TM Forum (telecoms), ARTS (retail), ACORD (insurance), and HL7 (healthcare) publish canonical schemas that attempt to cover every conceivable entity and relationship across the industry.
The problem: When you adopt an industry model, you inherit everything — including the 60–80% of tables that your business will never use. You then spend months trimming, renaming, and reshaping the model to fit your actual processes.
What is a Business Data Model?
A business data model is tailored, contextualized, and specific to YOUR organization. It reflects your actual business processes, your product lines, your org structure, your regulatory environment, your terminology, and your governance requirements.
Vibe Modelling generates business data models. The LLM understands your industry deeply (via the complexity tier system), but the model it produces is shaped entirely by your context.
| Aspect | Industry Data Model | Business Data Model (Vibe) |
|---|---|---|
| Scope | Entire industry vertical | Your specific business |
| Customization | Post-delivery (manual pruning) | Built-in (LLM-driven from your context) |
| Relevance | 20–40% directly applicable | 90–100% directly applicable |
| Time to production | Months of adaptation work | Hours with iterative vibing |
| Naming | Committee-standard naming | Your business terminology and conventions |
| Evolution | New version = re-adoption | Vibe the next version from the previous one |
| Cost | License fees + adaptation labor | Compute cost of LLM generation |
1. GENERATE → Describe your business, get a base model
2. VIBE IT → Review output, provide natural-language refinements
3. REPEAT → Each iteration = new version; agent auto-suggests next vibes
4. DEPLOY → Physical Unity Catalog schemas, tables, FKs, tags, sample data
Every Vibe model follows a strict four-level hierarchy:
graph TD
B["🏢 BUSINESS<br/><i>e.g., Contoso Manufacturing</i>"]
B --> D1["⚙️ Operations Division"]
B --> D2["💼 Business Division"]
B --> D3["🏛️ Corporate Division"]
D1 --> DOM1["📦 logistics"]
D1 --> DOM2["🏭 production"]
D1 --> DOM3["📋 inventory"]
D2 --> DOM4["👤 customer"]
D2 --> DOM5["💰 sales"]
D2 --> DOM6["🧾 billing"]
D3 --> DOM7["👥 hr"]
D3 --> DOM8["📊 finance"]
DOM5 --> P1["🗂️ order"]
DOM5 --> P2["🗂️ order_item"]
DOM5 --> P3["🗂️ quote"]
P1 --> A1["🔑 order_id <i>(PK, BIGINT)</i>"]
P1 --> A2["📅 order_date <i>(DATE)</i>"]
P1 --> A3["🔗 customer_id <i>(FK → customer.customer)</i>"]
P1 --> A4["💲 total_amount <i>(DECIMAL)</i>"]
P1 --> A5["📌 status <i>(STRING)</i>"]
Divisions are the top-level organizational grouping. Every domain belongs to exactly one division.
| Division | Purpose | Typical Domains |
|---|---|---|
| Operations | Core operational backbone — mechanisms, infrastructure, processes | Logistics, production, inventory, service delivery, supply chain, quality control |
| Business | Revenue-generating and customer-facing functions | Customer/party, billing/revenue, product catalog, sales, subscriptions |
| Corporate | Supporting functions for governance (NOT directly revenue-generating) | HR, finance, legal/compliance, marketing, procurement, IT |
Division Balance Rules:
- Operations + Business MUST be >= 80% of all domains
- Corporate is capped at <= 20% of total domains
- No Corporate domain allowed until Operations AND Business each have >= 2 domains
A domain is a logical grouping of related data products (tables). When deployed, each domain maps 1:1 to a Unity Catalog schema.
- Named in
snake_case, exactly one word (e.g.,customer,fulfillment,logistics) - Count depends on model scope (MVM vs ECM) and industry complexity tier
- The
shareddomain is reserved — the pipeline auto-creates it during SSOT consolidation; never create it manually
A product is a data table within a domain — a first-class business entity with its own identity, lifecycle, and attributes.
- Every product gets a PK column:
<product_name>_<pk_suffix>(default:<product_name>_id) - Classified as CORE (entities stakeholders query directly) or HELPER (supporting entities)
- Tagged by data type:
master_data|transactional_data|reference_data|association_table
The First-Class Entity Test: A product must have its own identity, its own lifecycle, and at least 5 unique business attributes. Anything less should be an attribute on another table or merged.
Each attribute has:
| Property | Example | Purpose |
|---|---|---|
attribute |
customer_email |
Column name (snake_case) |
type |
STRING |
Apache Spark SQL data type |
tags |
restricted,pii_email |
Classification + PII tags |
value_regex |
^[a-zA-Z0-9._%+-]+@ |
Validation pattern or enum |
business_glossary_term |
Customer Email Address |
Human-readable business name |
description |
Primary contact email for the customer |
What this attribute represents |
reference |
GDPR Article 6 |
Regulatory or standard reference |
foreign_key_to |
customer.customer.customer_id |
FK target (domain.product.pk), or empty |
FK relationships MUST form a Directed Acyclic Graph (DAG):
✅ VALID: order → customer → address
❌ INVALID: order → customer → address → order (cycle!)
Why it matters: Circular dependencies prevent clean data loading order, break ETL pipelines, and indicate modeling errors.
How the agent enforces it:
- Detect — Python DFS (Depth-First Search) cycle detection during QA
- Break — LLM Cycle Break specialist determines which FK to remove (always the weakest link)
- Verify — Re-run DFS to confirm graph is now a valid DAG
- Iterate — Up to 5 rounds to resolve all cycles, including residual ones
Allowed exception: Self-referencing hierarchical FKs (e.g.,
parent_category_id → category.category_id) are permitted — they represent tree structures, not multi-node cycles.
Each core business concept has exactly ONE authoritative domain and ONE authoritative product (table) that owns it. No concept is duplicated across domains.
What SSOT prevents:
- Two
customertables in different domains - A
producttable in bothsalesandinventory - An
invoicetable in bothbillingandfinance
How SSOT is enforced:
| Phase | Mechanism |
|---|---|
| Generation | LLM places each entity in its authoritative domain — the domain that CREATES/OWNS that data |
| QA Deduplication | Global pass detects same-name products + synonym pairs (e.g., customer vs client) with 60%+ attribute overlap |
| Consolidation | Overlapping products merged into shared domain with discriminator column |
| Cross-Domain References | Other domains use FK columns to reference the authoritative table — no duplication |
The "Where Do I Go?" Test: For any entity, ask: "If a business user needs this information, is there exactly ONE place to get it?" If the answer is ambiguous, there is an SSOT violation.
| MVM (Minimum Viable Model) | ECM (Expanded Coverage Model) | |
|---|---|---|
| Size | 30–50% of ECM table count | Full coverage |
| Attribute depth | SAME as ECM — full production-grade (same min/max per tier) | Full production-grade |
| Domains | Essential business functions only | All functions including corporate back-office |
| Ideal for | SMBs, rapid deployments, POCs, dev/test | Fortune 100, multinational enterprises |
| Lightness | Fewer domains & tables (NEVER thinner attributes) | Maximum breadth |
MVM is NOT a skeleton or demo toy. It is a production-ready subset where every delivered table is fully-featured.
Every domain is organized into subdomains — semantic groupings of related products within a domain. Subdomains provide an additional organizational layer between domains and products.
| Rule | Description |
|---|---|
| Count | Minimum and maximum subdomains per domain are defined by the complexity tier. Never exactly 1 subdomain per domain. |
| Naming | EXACTLY 2 words per subdomain name. 1 word = rejected. 3+ words = rejected. |
| Min Products | Every subdomain must contain at least the minimum products per subdomain defined by the tier: ECM tier_1–tier_4: 3, ECM tier_5: 2; MVM tier_1–tier_3: 3, MVM tier_4–tier_5: 2. |
| No Overlap | No two subdomains within the same domain may share any word in their names. |
| Business Terms | Use business terminology, NOT technical terms. |
| Balanced | Products distributed as evenly as possible across subdomains. |
| No Placeholders | NEVER use: "Sub Domain1", "Category 1", "Group A", "N/A", "Other", "General", "Miscellaneous", "Uncategorized". |
| No Drift | Each subdomain belongs to exactly one parent domain. |
Example: In the party domain:
identity: individual, organization, party_identification, kyc_verificationengagement: party_interaction, consent_record, loyalty_enrollment
The agent auto-classifies your business into one of five tiers:
Classification is based on 7 scoring dimensions (scored 0 or 1 each):
- Regulatory density — 3+ distinct regulatory bodies imposing data/reporting requirements
- Party complexity — 3+ distinct party types (customers, suppliers, partners, employees, etc.)
- Product hierarchy depth — 50+ product variants with complex bundling/pricing/lifecycle
- Infrastructure management — Owns physical or digital infrastructure requiring asset tracking
- Industry canonical model — 200+ entity types defined by an industry standards body
- Transaction complexity — 10+ distinct transaction types with multi-step lifecycles
- Operational system landscape — 5+ major systems of record across business functions
Rule: If the business falls between two tiers, classify UP (prefer the more complex tier).
| Tier | Label | Hallmarks | ECM Domains | ECM Products/Domain | Attrs/Product | Subdomains/Domain |
|---|---|---|---|---|---|---|
tier_1 |
Ultra-Complex | 5+ regulatory bodies, multi-entity structures (banking, insurance, pharma) | 15–22 | 14–28 | 15–50 | 3–6 |
tier_2 |
Complex | 2–4 regulatory bodies, multi-channel (telecoms, energy, healthcare) | 12–18 | 14–26 | 12–50 | 2–5 |
tier_3 |
Moderate | 1–2 regulatory bodies, 3+ business functions (manufacturing, retail) | 10–15 | 12–24 | 10–45 | 2–5 |
tier_4 |
Standard | Light regulation, regional complexity (logistics, agriculture) | 8–12 | 10–20 | 10–40 | 2–4 |
tier_5 |
Simple | Minimal regulation, service-based (consulting, SaaS, media) | 5–8 | 8–18 | 8–35 | 2–4 |
MVM Tier Sizing (attribute depth is the SAME as ECM):
| Tier | MVM Domains | MVM Products/Domain | Subdomains/Domain |
|---|---|---|---|
tier_1 |
9–14 | 8–16 | 2–4 |
tier_2 |
8–12 | 8–14 | 2–4 |
tier_3 |
6–10 | 7–13 | 2–4 |
tier_4 |
5–8 | 6–11 | 2–3 |
tier_5 |
3–6 | 5–10 | 2–3 |
MVM counts are automatically derived as ~30–50% of the ECM counts for each tier. Attribute depth (min/max attributes per product) is the same for both MVM and ECM within each tier.
Recipe 1: Generate Your First Model (Minimal Input)
| Widget | Value |
|---|---|
| 01. Business | My Company Name |
| 02. Description | A brief description of what your company does |
| 03. Operation | new base model |
| 05. Model Scope | Minimum Viable Model - MVM |
| 09. Installation Catalog | my_catalog (or leave blank for logical-only) |
| Everything else | Defaults |
Run the notebook. Done.
Recipe 2: Generate a Rich Model (Recommended)
| Widget | Value |
|---|---|
| 01. Business | Contoso Manufacturing |
| 02. Description | A multinational aluminum smelting company operating across 12 countries with 15,000 employees |
| 03. Operation | new base model |
| 05. Model Scope | Expanded Coverage Model - ECM |
| 06. Business Domains | production, quality, supply, customer, sales, logistics, billing |
| 07. Org Divisions | Operations, Business and Corporate |
| 09. Installation Catalog | contoso_dev |
| 10. Sample Records | 10 |
Recipe 3: Vibe an Existing Model
| Widget | Value |
|---|---|
| 01. Business | Contoso Manufacturing |
| 02. Description | (same as before) |
| 03. Operation | vibe modeling of version |
| 04. Version | 1 |
| 08. Model Vibes | Add a warranty domain. Remove the corporate_strategy domain. Run quality checks. |
Recipe 4: Use Auto-Generated Next Vibes
After any pipeline run, find next_vibes.txt in the vibes/ output folder. Copy the suggested vibes into widget 08.
| Widget | Value |
|---|---|
| 01. Business | Contoso Manufacturing |
| 02. Description | (same as before) |
| 03. Operation | vibe modeling of version |
| 04. Version | Next version number (previous + 1) |
| 08. Model Vibes | (paste recommended vibes from next_vibes.txt) |
Recipe 5: Deploy to a New Catalog
| Widget | Value |
|---|---|
| 03. Operation | install model |
| 09. Installation Catalog | prod_catalog |
| 11. Model JSON File | /Volumes/catalog/schema/volume/vibes/contoso/v1_ecm/model.json |
| 10. Sample Records | 0 (or a number for test data) |
The 03. Operation widget selects which pipeline mode to run:
| Operation | Purpose | Key Requirements |
|---|---|---|
new base model |
Generate a brand-new data model from scratch | Business name + description (via widgets) |
vibe modeling of version |
Apply natural-language instructions to refine an existing version | Version + vibes (widget 08) |
shrink ecm |
Convert an ECM to a leaner MVM | Version + deployment catalog |
enlarge mvm |
Expand an MVM into a comprehensive ECM | Version + deployment catalog |
install model |
Deploy a logical model into physical Unity Catalog objects | Model JSON file (widget 11) + deployment catalog |
uninstall model version |
Remove a version's physical artifacts from the catalog | Business name + version + catalog |
generate sample data |
Generate synthetic sample records for an existing deployed model | Model JSON file (widget 11) + deployment catalog |
This is the core power of Vibe Modelling — iterative refinement via natural language:
graph LR
A["🆕 New Base Model<br/>Version 1"] --> B["🔍 Review Output"]
B --> C["📝 Write Vibes<br/><i>or use auto-generated next_vibes.txt</i>"]
C --> D["🎵 Vibe Modeling<br/>Version 2"]
D --> E["🔍 Review"]
E --> F["🎵 Vibe v3"]
F --> G["..."]
G --> H["✅ Deploy"]
Vibes are free-form natural language. The agent interprets them and translates them into specific actions:
"Add a compliance domain with regulatory_filing and audit_trail tables"
"Remove the HR domain, we don't need it"
"Merge the customer_support domain into the customer domain"
"Add a source_system column to every table"
"Run a full quality check and fix any issues found"
"The order table should have a shipping_address_id FK to the address table"
"Rename all tables starting with dim_ to remove that prefix"
"Normalize the order domain to 3NF"
"Mark all email columns as PII"
"Generate an ontology (RDF/RDFS) for the model"
"Keep only billing-related tables, drop everything else"
"Make this model ECM — I need the large version"
After every pipeline run, the agent produces next_vibes.txt and current_vibes.txt in the vibes/ folder, containing:
- Your current business context (preserved as-is)
- Recommended vibes for the next iteration (based on QA findings)
- Model health metadata: confidence score, warning counts, issue breakdown
- Version history and progression tracking
To use them: copy the recommended vibes into widget 08. Model Vibes, set operation to vibe modeling of version, set the next version, and run.
The notebook exposes 28 configurable widgets. Below is the complete reference.
| # | Widget | Type | Mandatory | Default | Description |
|---|---|---|---|---|---|
| 01 | Business (name) | Text | Yes | — | Your business/organization name |
| 02 | Description | Text | Recommended | — | What your business does (richer = better model) |
| 03 | Operation | Dropdown | Yes | new base model |
Pipeline operation to run |
| 04 | Version | Dropdown | Conditional1 | — | Model version number (1–100) |
| 05 | Model Scope | Dropdown | Yes | Minimum Viable Model - MVM |
MVM (lean) or ECM (comprehensive) |
| 06 | Business Domains | Text | No | — | Comma-separated seed domains |
| 07 | Included Org Divisions | Dropdown | Yes | Operations and Business |
Which divisions to include |
| 08 | Model Vibes | Text | Conditional2 | — | Natural language instructions — inline (max 2,000 chars) or file path to a .txt on a UC Volume |
| 09 | Installation Catalog | Text | Conditional3 | — | Unity Catalog for physical deployment |
| 10 | Sample Records | Dropdown | No | 0 |
Synthetic records per table (0 = none) |
| 11 | Model JSON File | Text | Conditional4 | — | Path to a previously generated model.json for re-install or continuation |
1 Required for all operations except new base model (auto-assigned).
2 Required for vibe modeling of version.
3 Required for install, uninstall, generate sample data, shrink, enlarge. Optional for new base model and vibe modeling.
4 Required for install model, generate sample data when re-deploying a previously generated model.
Detailed Widget Descriptions (click to expand)
The name of your business. Used as the top-level identifier across the entire model. Case-insensitive matching (stored via LOWER()).
Sample values: Contoso Inc, Acme Healthcare, Global Telecom Corp, NextGen Retail
A rich description of what your business does. Include industry, size, geography, key products/services. The LLM uses this to determine the industry complexity tier and tailor the model.
Sample values: A multinational aluminum smelting and manufacturing company operating across 12 countries, A digital-first healthcare provider specializing in telemedicine
Options: new base model | vibe modeling of version | shrink ecm | enlarge mvm | install model | uninstall model version | generate sample data
See the Operations section for full details on each mode.
Options: Empty, or 1–100
For new base model, leave empty — auto-assigned as 1 (or auto-incremented). For vibe modeling of version, this is the version you are modifying; output creates version N+1.
Options: Minimum Viable Model - MVM | Expanded Coverage Model - ECM
MVM = lean core (fewer domains/tables, same attribute depth). ECM = comprehensive Fortune 100 coverage.
Comma-separated list of specific domains you want. If blank, the LLM auto-generates the optimal set for your industry.
Sample values: customer, sales, billing, inventory, logistics | clinical, pharmacy, research
Options: Operations | Operations and Business | Operations, Business and Corporate
Controls which organizational divisions contribute domains to the model.
Supports two input modes:
- Inline text — type your vibe instructions directly into the widget (max 2,000 characters). Best for short, targeted changes.
- File path — provide a path to a
.txtfile on a Unity Catalog Volume (e.g.,/Volumes/catalog/schema/vol/my_vibes.txt). Best for longer, multi-paragraph instructions that exceed the inline limit.
See The Vibing Workflow for examples of what vibes can do.
The Unity Catalog where physical schemas, tables, FK constraints, tags, and sample data will be created. Must already exist. You need CREATE SCHEMA privileges. If blank for new base model/vibe modeling, only the logical model (JSON artifacts) is produced.
Sample values: dev_catalog, prod_data_models, contoso_lakehouse
Options: 0, 5, 10, 15, 20, 25, 50, 100
0 = skip sample data generation. 10 = good default for review. 50–100 = for load testing / demos. Sample data respects FK relationships (child records reference valid parent IDs).
(Optional) Path to a previously generated model.json file on a Unity Catalog Volume. Used when re-installing or continuing from a prior run. After every run, the agent generates next_vibes.txt in the vibes/ folder — review it for recommended next vibe instructions.
Sample values: /Volumes/my_catalog/my_schema/vol/vibes/contoso/v1_ecm/model.json
| # | Widget | Type | Default | Options / Format |
|---|---|---|---|---|
| 12 | Naming Convention | Dropdown | snake_case |
snake_case, camelCase, PascalCase, SCREAMING_CASE |
| 13 | Primary Key Suffix | Dropdown | _id |
_id, _key, _pk, id, key |
| 15 | Schema Prefix | Text | (empty) | e.g., stg_, raw_, dw_ |
| 16 | Tag Prefix | Text | dbx_ |
e.g., dbx_, vibe_, mdl_ |
| 17 | Table ID Type | Dropdown | BIGINT |
BIGINT, INT, LONG, STRING |
| 18 | Boolean Format | Dropdown | Boolean (True/False) |
Boolean (True/False), Int (0/1), String (Y/N) |
| 19 | Date Format | Dropdown | yyyy-MM-dd |
yyyy-MM-dd, dd/MM/yyyy, MM/dd/yyyy, yyyy/MM/dd, dd-MM-yyyy |
| 20 | Timestamp Format | Dropdown | yyyy-MM-dd'T'HH:mm:ss.SSSXXX |
4 ISO/standard options |
| 21 | Classification Levels | Text | restricted=restricted, confidential=confidential, internal=Internal, public=public |
Comma-separated key=Label pairs |
| 22 | Housekeeping Columns | Dropdown | No |
No, Yes — adds created_by, created_at, updated_by, updated_at |
| 23 | History Tracking Columns | Dropdown | No |
No, Yes — adds valid_from, valid_to, is_current (SCD Type 2) |
| 09a | Cataloging Style | Dropdown | One Catalog |
One Catalog, Catalog per Division, Catalog per Domain |
| 09b | Catalog Prefix | Text | (empty) | e.g., dev_, prod_ |
| 09c | Catalog Suffix | Text | (empty) | e.g., _lakehouse, _dw |
| 15a | Schema Suffix | Text | (empty) | e.g., _db, _schema |
| 16a | Tag Suffix | Text | (empty) | e.g., _tag |
| 24 | Vibe Session ID | Text | (empty) | UUID for external UI progress tracking |
Widget #14 does not exist (numbering gap between 13 and 15).
After every run, the agent produces next_vibes.txt in the vibes/ output folder with recommended next vibe instructions. Copy these into widget 08. Model Vibes for the next iteration.
The model.json file includes session metadata:
{
"_next_vibe_metadata": {
"generated_from_version": "v1_mvm",
"model_version": "1",
"status": "needs_work",
"confidence_score": 78,
"summary": "Model has 3 unlinked columns and 1 siloed table",
"issues_addressed": ["..."],
"issues_not_addressed": ["..."],
"data_modeler_notes": "Recommend adding a warehouse domain",
"model_stats_at_generation": { "domains": 8, "products": 47, "attributes": 523 },
"issue_counts": { "error": 0, "warning": 3, "info": 5 },
"version_history": ["..."]
}
}Vibes are translated into 190+ specific actions organized into categories. You never need to name these actions — just describe what you want in natural language and the agent maps your intent.
Entity Management (1–25)
| # | Action | What It Does |
|---|---|---|
| 1 | drop |
Remove a domain, product, attribute, tag, or link |
| 2 | create |
Add a new domain, product, or attribute |
| 3 | rename |
Change the name of an entity |
| 4 | alter_description |
Modify the description |
| 5 | change_type |
Change an attribute's data type |
| 6 | merge |
Combine two entities into one |
| 7 | split |
Divide one entity into multiple |
| 8 | move_product |
Move a product to another domain |
| 9 | move_attribute |
Move an attribute to another product |
| 10 | delete_attribute |
Remove an attribute |
| 11 | create_link |
Create a foreign key relationship |
| 12 | drop_link |
Remove a foreign key relationship |
| 13–17 | Prefix/suffix ops | Add, remove, or change name prefixes/suffixes |
| 18–24 | Metadata ops | Update glossary terms, references, regex, tags |
| 25 | modify |
Regenerate an entity with specific guidance |
Quality Check & Analysis (26–38)
| # | Action | What It Does |
|---|---|---|
| 26 | run_quality_checks |
Compound: runs detect_duplicates + dedupe_attributes + detect_cycles + detect_siloed + fix_fk_anomalies |
| 26b | run_product_domain_fit |
LLM audit: are products in the right domains? Relocates misplaced ones. |
| 27 | detect_duplicates |
Find SSOT violations (semantic duplicate products across domains) |
| 28 | fix_duplicates |
Auto-merge/remove detected duplicates |
| 29 | dedupe_attributes |
Remove duplicate columns within products |
| 30 | detect_cycles |
Find circular FK dependencies |
| 31 | break_cycles |
Auto-fix cycles by removing weakest FK links |
| 32 | detect_siloed |
Find completely disconnected tables (no FKs in any direction) |
| 33 | fix_siloed |
Connect disconnected tables via linking attempts |
| 34 | review_links |
Audit all FK relationships for anomalies |
| 35 | fix_fk_anomalies |
Repair broken, mismatched, or orphaned FK references |
| 36 | fix_ambiguous_fks |
Resolve FKs that match multiple target tables |
| 37 | merge_small_tables |
Analyze and consolidate tables with < 5 attributes |
| 38 | identify_core_products |
Identify business-critical products for protection |
| 95 | model_checkup |
Mega-compound: runs static analysis + auto-queues ALL appropriate fixes |
FK & Linking Operations (39–46, 133–138)
| # | Action | What It Does |
|---|---|---|
| 39 | run_linking |
Compound: in-domain + cross-domain + M:N detection |
| 40 | run_in_domain_linking |
Link FKs within each domain |
| 41 | run_cross_domain_linking |
Link FKs across different domains |
| 42 | detect_many_to_many |
Find potential M:N relationships |
| 43 | create_junction_tables |
Create bridge/association tables for M:N |
| 45 | redirect_fk |
Redirect FK to a different target table |
| 46 | find_unlinked_columns |
Find *_id columns without FK relationships |
| 133 | remove_product_prefix |
Remove redundant table-name prefix from columns |
| 134 | fix_fk_column_naming |
Fix FK columns that don't end with the target PK name |
| 135 | connect_table |
Connect a specific disconnected table |
| 136 | link_specific_columns |
Link an explicit list of unlinked _id columns |
| 138 | find_missing_fk_links |
Comprehensive: LLM classifies each unlinked column as LINK, CREATE, DROP, or KEEP_AS_IS |
Bulk Operations (62–64, 110–114)
| # | Action | What It Does |
|---|---|---|
| 62 | bulk_rename_products |
Rename multiple products by pattern |
| 63 | bulk_drop_products |
Drop products matching a pattern (e.g., stg_*) |
| 64 | bulk_move_products |
Move products matching a pattern to another domain |
| 110 | bulk_change_type |
Change data type for attributes by pattern |
| 111 | bulk_set_nullable |
Set nullable for attributes by pattern |
| 113 | bulk_remove_attributes |
Remove attributes matching a pattern from all tables |
Tag & Classification Operations (54–61, 98–106, 156–163)
| # | Action | What It Does |
|---|---|---|
| 54–57 | Tag add/remove | Add or remove tags from products or domains |
| 58 | conditional_tag |
Tag entities matching a condition |
| 59–60 | Bulk tagging | Tag by name pattern or by column presence |
| 98–101 | PII/sensitive marking | Mark as PII, sensitive, encrypted, deprecated |
| 103 | set_table_type |
Classify as dimension, fact, lookup, bridge, staging, archive |
| 156 | classify_table_tier |
Medallion architecture: bronze / silver / gold |
| 161 | set_data_owner |
Assign data owner/steward |
| 162 | set_update_frequency |
Document expected freshness (real_time, daily, monthly, etc.) |
| 163 | map_to_source_system |
Map table/column to source system (SAP, Salesforce, etc.) |
Template Column Actions (148–155)
| # | Action | Columns Added |
|---|---|---|
| 148 | add_scd_columns |
effective_from, effective_to, is_current, row_hash |
| 149 | add_audit_columns |
created_at, updated_at, created_by, updated_by |
| 150 | add_soft_delete_columns |
is_deleted, deleted_at, deleted_by |
| 151 | add_temporal_columns |
valid_from, valid_to, system_from, system_to |
| 152 | add_versioning_columns |
version_number, version_valid_from, version_valid_to, is_latest_version |
| 153 | add_multitenancy_columns |
tenant_id |
| 154 | add_lineage_columns |
source_system, source_table, ingestion_timestamp, etl_job_id |
| 155 | add_gdpr_columns |
consent_status, consent_date, data_subject_request_id, right_to_erasure_date |
Structural Transformations (164–191)
| # | Action | What It Does |
|---|---|---|
| 164 | normalize_to_3nf |
Apply Third Normal Form normalization (LLM-powered) |
| 165 | denormalize_for_analytics |
Create wide/denormalized tables for BI |
| 184 | promote_to_table |
Extract an attribute into its own lookup/reference table + FK |
| 185 | inline_table |
Merge a child table into its parent (denormalize) |
| 186 | swap_domains |
Atomically swap two domain names |
| 187 | impact_analysis |
Show what would break if a table/domain were dropped (read-only) |
| 190 | enlarge_model |
Wholesale expansion from MVM to ECM scope |
| 191 | shrink_model |
Wholesale reduction from ECM to MVM scope |
| 178 | VIBE_PRUNE_PROMPT |
LLM-powered: keep only tables related to a focus area |
| 179 | drop_domains_except |
Drop all domains except a specified keep-list |
Artifact Generation (139–147)
| # | Action | Output |
|---|---|---|
| 139 | generate_readme |
README documentation |
| 140 | generate_data_model_json |
Complete model as JSON |
| 141 | generate_ontology |
RDF/RDFS ontology (Turtle format) |
| 142 | generate_dbml |
DBML schema for visualization tools |
| 143 | generate_release_notes |
Changelog/release notes for the version |
| 144 | generate_excel |
Excel workbook export |
| 145 | generate_data_dictionary |
Comprehensive data dictionary |
| 146 | export_model_report |
Full model documentation report |
| 147 | generate_test_cases |
Data quality test case specifications |
Metric View Operations (126–134)
| # | Action | What It Does |
|---|---|---|
| 126 | run_metric_modeling |
Generate Databricks metric views (dimensions + measures) |
| 129 | add_metric_measure |
Add a specific KPI measure to metric views |
| 130 | remove_metric_measure |
Remove a measure |
| 131 | add_metric_dimension |
Add a grouping dimension |
| 132 | remove_metric_dimension |
Remove a dimension |
| 133 | alter_metric_filter |
Set/update metric view filter logic |
| 134 | drop_metric_view |
Remove an entire metric view |
When you run the agent, it executes these progress stages in order (the # column is for reference only — no numeric stage IDs are emitted in progress events):
| # | Stage | Duration | What Happens |
|---|---|---|---|
| 1 | Setup and Configuration | 2–10s | Validates inputs, creates metamodel tables |
| 2 | Interpreting Instructions | 10–30s | (Vibe mode only) Parses vibes into structured action plan |
| 3 | Collecting Business Context | 10–30s | LLM enriches your description across business/industry dimensions |
| 4 | Designing Domains | 15–60s | Generates domains following division model + SSOT |
| 5 | Creating Data Products | 1–10m | Products per domain with architect review |
| 6 | Enriching Data Products with Attributes | 5–40m | All columns for every product |
| 7 | Cross-Domain Linking | 1–5m | In-domain → global sweep → pairwise comparison |
| 8 | Quality Assurance | 30s–3m | 9 sub-checks (naming, PK/types, overlaps, topology, auto-remediation) |
| 9 | Applying Naming Conventions | 10–30s | Final naming consistency pass |
| 10 | Model Finalization | 10–30s | Finalize logical model snapshot before physical deployment |
| 11 | Subdomain Allocation | 10–30s | Allocates products into semantic subdomains |
| 19 | Generating Metric View Artifacts | 10–30s | Exports metric view definitions (legacy stage ID) |
| 17 | Generating Artifacts | 30s–2m | README, Excel/CSV, model JSON, data dictionary, model report |
| 12 | Physical Schema Construction | 1–10m | Creates UC schemas + Delta tables (if catalog set) |
| 13 | Applying Foreign Keys | 30s–2m | FK constraints on physical tables |
| 14 | Applying Tags | 2–15m | Classification, PII, data type tags on schemas/tables/columns |
| 15 | Applying Metric Views | 30s–5m | Databricks metric views for KPI tracking |
| 16 | Generating Sample Data | 1–15m | Synthetic records respecting FK relationships |
| 18 | Consolidation and Cleanup | 10–30s | Consolidates/merges metadata and cleanup |
| File | Description |
|---|---|
model.json |
Complete model export (primary output) |
readme.md |
Human-readable model documentation |
vibes/current_vibes.txt |
Vibe instructions used in this run |
vibes/next_vibes.txt |
Recommended next vibe instructions |
domains.json |
All domain definitions |
products.json |
All product (table) definitions |
attributes.json |
All attribute (column) definitions |
docs/*.xlsx / docs/*.csv |
Excel/CSV export of the entire model (CSV fallback if openpyxl unavailable) |
readme.md (parent folder) |
Model overview document comparing MVM and ECM scopes |
schemas/*.sql |
SQL DDL files (per-domain schemas, cross-domain FKs, catalogs) |
diagram/*_dbml_*.txt |
DBML schema diagram |
ontology/*_rdf_*.ttl |
RDF/Turtle ontology representation |
docs/releasenotes.txt |
Auto-generated release notes |
| File | Description |
|---|---|
docs/*_data_dictionary_*.txt |
Column-level reference guide |
docs/*_model_report_*.txt |
Full documentation report |
docs/*_test_cases_*.txt |
Generated test cases |
| Object | Example |
|---|---|
| Schemas | catalog.customer, catalog.sales, catalog.logistics |
| Tables | Delta tables with all columns and correct Spark SQL types |
| FK Constraints | Physical foreign key constraints between tables |
| Tags | Unity Catalog tags on schemas, tables, and columns |
| Metric Views | Databricks metric views for KPI calculation |
| Sample Data | Synthetic records with valid FK references |
The agent enforces a comprehensive rule system during generation and QA:
| Rule Group | Key Rules |
|---|---|
| Naming (G01) | snake_case default; domains: 1 word, singular, max 20 chars; products: 1–3 words, max 30 chars, no domain prefix; attributes: max 50 chars, no product prefix (except PK); FKs must end with target PK name; preserve unit qualifiers |
| Semantic Dedup (G02) | First-Class Entity Test (5 criteria); SSOT violation detection; 60%+ overlap → merge to shared; 90%+ overlap same domain → remove; attribute dedup >80% confidence; semantic distinction rules (method vs channel, target vs actual, lifecycle timestamps) |
| FK Integrity (G03) | FK target must exist; no bidirectional FKs; DAG required (no cycles); FK type compatibility; system identifiers exempted; hierarchical self-refs exempt |
| PK Rules (G04) | Every product has exactly 1 PK; PK = {product}_{suffix}; PK type = configured type (default BIGINT); PK exempt from prefix stripping |
| Normalization (G05) | 3NF enforcement; orphaned FK detection; denormalized attribute detection with >95% confidence; point-in-time snapshots exempt; measurement attributes exempt; geographic coordinates on physical entities exempt |
| Division Balance (G06) | Ops + Business ≥ 80% of domains; Corporate ≤ 20%; no Corporate until Ops AND Business each have ≥ 2; cross-division relocation forbidden; forbidden generic domain names; Org Chart Test; Fragmentation Test (30%+ overlap → merge) |
| Data Types (G07) | Spark SQL types only; no ARRAY/STRUCT/MAP; no calculated metrics/aggregates |
| Tags (G08) | Classification in tags only; PII = RESTRICTED + specific PII tag (pii_email, pii_phone, pii_financial, pii_health, pii_identifier, pii_address, pii_biometric, pii_name, pii_dob); empty tags for regular data; custom user tags always applied |
| Graph Topology (G09) | DAG enforcement; DFS cycle detection; hierarchical self-ref exemption; zero siloed tables; each domain ≥2 cross-domain connections; parent-child FKs never broken in cycle resolution |
| Honesty Check (G11) | 0–100 scale; below 55 = permanently discarded; 55–70 = borderline retry; ≥90 = accepted; contradiction penalty applied post-processing |
| Product Design (G12) | M:N requires 3 indicators with ≥2 of 3 strong; association ratio ECM ≤15%, MVM ≤5%; core products 1–3 per domain; forbidden product suffixes (_analysis, _analytics, _report, _prediction); Silver Layer only (no analytics products) |
| Sample Data (G13) | Exact record count; sequential PK from 10001; FK random [10001, 10001+N-1]; regex compliance; 3-letter country codes; no Lorem Ipsum; realistic business data |
| Vibe Constraints (G14) | Dedup overlap thresholds; max relocation percentage per pass; normalization confidence 95%; mutation budgets by mode (surgical, holistic, generative); domain hard ceiling factor 1.5x |
| Physical Schema Deployment (G15) | Schema and table creation factories; attribute dict factory; product dict factory; FK dependency ordering; consolidation and cleanup |
| Subdomains | Exactly 2-word names; min products per subdomain per tier; no overlapping words; balanced distribution; no placeholder names |
The agent uses a multi-model ensemble with automatic demotion and recovery:
| Order | Role | Purpose | Endpoint | Input Tokens | Output Tokens |
|---|---|---|---|---|---|
| 10 | Thinker (large) | Complex reasoning, architecture reviews, QA decisions | databricks-claude-opus-4-6 |
200,000 | 128,000 |
| 20 | Worker (large) | High-volume generation: products, attributes, FKs, dedup | databricks-claude-sonnet-4-6 |
200,000 | 64,000 |
| 30 | Thinker (large) | Fallback thinker | databricks-claude-opus-4-5 |
200,000 | 64,000 |
| 40 | Worker (large) | Fallback worker | databricks-claude-sonnet-4-5 |
200,000 | 64,000 |
| 50 | Worker (small) | Simpler tasks: domain generation, tag classification | databricks-gpt-oss-120b |
131,072 | 25,000 |
| 60 | Worker (tiny) | Sample data generation | databricks-gpt-oss-20b |
131,072 | 25,000 |
Automatic model demotion: After 3 cumulative failures on a model, the entire priority order is demoted — the failing model is pushed down and the next model takes its place. After 5 consecutive successes, the original order is restored. After 3 consecutive timeouts, a model is marked broken and skipped for the rest of the session.
Prompt architecture: 49 specialized prompts, each mapped to a specific model role and temperature setting. Thinker prompts use temperature 0 for deterministic reasoning. Worker prompts use temperature 0–0.3 for controlled generation. Sample generation uses temperature 0.5 for creative variety.
The agent generates Databricks metric views — reusable KPI definitions that sit on top of your data tables:
| Component | Description | Example |
|---|---|---|
| Dimensions | Grouping columns | region, product_category, fiscal_quarter |
| Measures | Single-aggregate expressions | SUM(revenue), COUNT(DISTINCT customer_id), AVG(order_value) |
| Filters | Row-level predicates | WHERE status = 'completed' |
Metric views are auto-generated per domain, focusing on KPIs that would appear in executive dashboards and quarterly business reviews. Each measure uses a single aggregate function (nested aggregates like AVG(SUM(...)) are not supported by metric view YAML and are automatically prevented).
| Symptom | Likely Cause | Resolution |
|---|---|---|
| "No deployment catalog specified — skipping physical model deployment" | Widget 09 is empty | Set deployment catalog to deploy physical tables |
| Model has too few domains | MVM scope + tier_5 classification | Switch to ECM, or provide seed domains in widget 06 |
| Model has irrelevant domains | LLM inferred from description | Vibe: "Remove the X domain" |
| FK pointing to wrong table | LLM linked incorrectly | Vibe: "Redirect order.warehouse_id FK to logistics.warehouse" |
| Circular dependency warning | DAG violation | Agent auto-fixes during QA; if persistent: "Break cycles" |
| SSOT violation (duplicate entities) | Same concept in multiple domains | Vibe: "Run quality checks" or "Fix duplicates" |
| Pipeline crashed mid-run | LLM timeout or transient error | Re-run with same inputs — agent cleans up incomplete versions |
| Model JSON file parse error | Invalid JSON (smart quotes, trailing commas) | Validate JSON; use straight quotes only |
| "Version X exists but is incomplete" | Previous run failed | Agent auto-detects — just re-run |
| Model too large / too many tables | ECM + high complexity tier | Use shrink ecm, or vibe: "Keep only core business tables" |
| Convention changes not applied | Convention widgets not set | Update the model convention widgets (12–24) with your desired values before running |
Metric views have COUNT(1) instead of real KPIs |
Nested aggregates were auto-replaced | LLM prompt prevents this; re-run metrics: "Regenerate metrics" |
| Term | Definition |
|---|---|
| Attribute | A column in a product (table) — has name, type, tags, description |
| Business Data Model | A data model tailored to a specific organization, generated by Vibe Modelling |
| Corporate Division | Supporting functions (HR, Finance, Legal) that enable but don't directly generate revenue |
| DAG | Directed Acyclic Graph — the required topology for FK relationships (no cycles) |
| Division | Top-level organizational grouping: Operations, Business, or Corporate |
| Domain | A logical grouping of related tables, deployed as a Unity Catalog schema |
| ECM | Expanded Coverage Model — comprehensive enterprise-grade model scope |
| FK | Foreign Key — a column referencing another table's primary key |
| Honesty Check | LLM self-assessment score (0–100%) to ensure output quality |
| Industry Data Model | A generic, vendor-published schema template for an entire industry vertical |
| Junction Table | An association table resolving M:N relationships between two entities |
| Metric View | A Databricks KPI definition with dimensions, measures, and filters |
| MVM | Minimum Viable Model — lean, production-ready scope (30–50% of ECM) |
| PK | Primary Key — the unique identifier column for a table |
| Product | A data table within a domain — a first-class business entity |
| SSOT | Single Source of Truth — each business concept has one and only one authoritative owner |
| Tier | Industry complexity classification (tier_1 = Ultra-Complex to tier_5 = Simple) |
| Vibe | A natural language instruction provided to the agent to modify the model |
| Vibe Modelling | The iterative process of generating and refining data models using natural language |
Built on Databricks Serverless Compute with Unity Catalog governance