Skip to content

amralieg/vibe-modelling-agent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

124 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Databricks Vibe Modelling Agent

Generate production-grade enterprise data models from natural language

Databricks Serverless Unity Catalog LLM Powered

Describe your business. Get a data model. Vibe it until it's perfect.


Concepts · Getting Started · Widget Reference · Vibing Workflow · Action Catalog · Troubleshooting


Table of Contents


📚 Documentation

Document Description
docs/ Documentation index — whitepaper, design guide, integration guide
docs/design-guide.md Technical design reference
docs/integration-guide.md UI/consumer integration protocol
docs/whitepaper.md Philosophy and complete rules catalog
runner/readme.md Pipeline orchestrator guide
tests/readme.md Test suite reference

What Is Vibe Modelling?

Vibe Modelling is a Databricks-native, LLM-powered approach to generating enterprise data models from natural language. Instead of manually drawing ER diagrams, writing DDL, or importing pre-built industry templates, you describe your business in plain English and the agent builds a complete, production-grade data model — domains, tables, columns, foreign keys, tags, sample data, and documentation — end to end.

The name "Vibe" reflects the core workflow:

Generate a base model → review → vibe it with natural language → repeat → deploy

Each iteration produces a new version. The agent carries forward your context so nothing is lost between runs. You are never locked into a static template — the model evolves with your business.


📖 Concepts

Industry Data Models vs. Business Data Models

What is an Industry Data Model?

An industry data model is a generic, one-size-fits-all template designed for an entire vertical — retail, banking, healthcare, telecoms, etc. Organizations like the TM Forum (telecoms), ARTS (retail), ACORD (insurance), and HL7 (healthcare) publish canonical schemas that attempt to cover every conceivable entity and relationship across the industry.

The problem: When you adopt an industry model, you inherit everything — including the 60–80% of tables that your business will never use. You then spend months trimming, renaming, and reshaping the model to fit your actual processes.

What is a Business Data Model?

A business data model is tailored, contextualized, and specific to YOUR organization. It reflects your actual business processes, your product lines, your org structure, your regulatory environment, your terminology, and your governance requirements.

Vibe Modelling generates business data models. The LLM understands your industry deeply (via the complexity tier system), but the model it produces is shaped entirely by your context.

Aspect Industry Data Model Business Data Model (Vibe)
Scope Entire industry vertical Your specific business
Customization Post-delivery (manual pruning) Built-in (LLM-driven from your context)
Relevance 20–40% directly applicable 90–100% directly applicable
Time to production Months of adaptation work Hours with iterative vibing
Naming Committee-standard naming Your business terminology and conventions
Evolution New version = re-adoption Vibe the next version from the previous one
Cost License fees + adaptation labor Compute cost of LLM generation

The Vibe Philosophy

1. GENERATE  →  Describe your business, get a base model
2. VIBE IT   →  Review output, provide natural-language refinements
3. REPEAT    →  Each iteration = new version; agent auto-suggests next vibes
4. DEPLOY    →  Physical Unity Catalog schemas, tables, FKs, tags, sample data

The Four-Level Hierarchy: Divisions → Domains → Products → Attributes

Every Vibe model follows a strict four-level hierarchy:

graph TD
    B["🏢 BUSINESS<br/><i>e.g., Contoso Manufacturing</i>"]
    B --> D1["⚙️ Operations Division"]
    B --> D2["💼 Business Division"]
    B --> D3["🏛️ Corporate Division"]
    
    D1 --> DOM1["📦 logistics"]
    D1 --> DOM2["🏭 production"]
    D1 --> DOM3["📋 inventory"]
    
    D2 --> DOM4["👤 customer"]
    D2 --> DOM5["💰 sales"]
    D2 --> DOM6["🧾 billing"]
    
    D3 --> DOM7["👥 hr"]
    D3 --> DOM8["📊 finance"]
    
    DOM5 --> P1["🗂️ order"]
    DOM5 --> P2["🗂️ order_item"]
    DOM5 --> P3["🗂️ quote"]
    
    P1 --> A1["🔑 order_id <i>(PK, BIGINT)</i>"]
    P1 --> A2["📅 order_date <i>(DATE)</i>"]
    P1 --> A3["🔗 customer_id <i>(FK → customer.customer)</i>"]
    P1 --> A4["💲 total_amount <i>(DECIMAL)</i>"]
    P1 --> A5["📌 status <i>(STRING)</i>"]
Loading

Level 1: Divisions

Divisions are the top-level organizational grouping. Every domain belongs to exactly one division.

Division Purpose Typical Domains
Operations Core operational backbone — mechanisms, infrastructure, processes Logistics, production, inventory, service delivery, supply chain, quality control
Business Revenue-generating and customer-facing functions Customer/party, billing/revenue, product catalog, sales, subscriptions
Corporate Supporting functions for governance (NOT directly revenue-generating) HR, finance, legal/compliance, marketing, procurement, IT

Division Balance Rules:

  • Operations + Business MUST be >= 80% of all domains
  • Corporate is capped at <= 20% of total domains
  • No Corporate domain allowed until Operations AND Business each have >= 2 domains

Level 2: Domains

A domain is a logical grouping of related data products (tables). When deployed, each domain maps 1:1 to a Unity Catalog schema.

  • Named in snake_case, exactly one word (e.g., customer, fulfillment, logistics)
  • Count depends on model scope (MVM vs ECM) and industry complexity tier
  • The shared domain is reserved — the pipeline auto-creates it during SSOT consolidation; never create it manually

Level 3: Products (Tables)

A product is a data table within a domain — a first-class business entity with its own identity, lifecycle, and attributes.

  • Every product gets a PK column: <product_name>_<pk_suffix> (default: <product_name>_id)
  • Classified as CORE (entities stakeholders query directly) or HELPER (supporting entities)
  • Tagged by data type: master_data | transactional_data | reference_data | association_table

The First-Class Entity Test: A product must have its own identity, its own lifecycle, and at least 5 unique business attributes. Anything less should be an attribute on another table or merged.

Level 4: Attributes (Columns)

Each attribute has:

Property Example Purpose
attribute customer_email Column name (snake_case)
type STRING Apache Spark SQL data type
tags restricted,pii_email Classification + PII tags
value_regex ^[a-zA-Z0-9._%+-]+@ Validation pattern or enum
business_glossary_term Customer Email Address Human-readable business name
description Primary contact email for the customer What this attribute represents
reference GDPR Article 6 Regulatory or standard reference
foreign_key_to customer.customer.customer_id FK target (domain.product.pk), or empty

DAG Enforcement — No Circular Dependencies

FK relationships MUST form a Directed Acyclic Graph (DAG):

✅ VALID:   order → customer → address
❌ INVALID: order → customer → address → order  (cycle!)

Why it matters: Circular dependencies prevent clean data loading order, break ETL pipelines, and indicate modeling errors.

How the agent enforces it:

  1. Detect — Python DFS (Depth-First Search) cycle detection during QA
  2. Break — LLM Cycle Break specialist determines which FK to remove (always the weakest link)
  3. Verify — Re-run DFS to confirm graph is now a valid DAG
  4. Iterate — Up to 5 rounds to resolve all cycles, including residual ones

Allowed exception: Self-referencing hierarchical FKs (e.g., parent_category_id → category.category_id) are permitted — they represent tree structures, not multi-node cycles.


Single Source of Truth (SSOT)

Each core business concept has exactly ONE authoritative domain and ONE authoritative product (table) that owns it. No concept is duplicated across domains.

What SSOT prevents:

  • Two customer tables in different domains
  • A product table in both sales and inventory
  • An invoice table in both billing and finance

How SSOT is enforced:

Phase Mechanism
Generation LLM places each entity in its authoritative domain — the domain that CREATES/OWNS that data
QA Deduplication Global pass detects same-name products + synonym pairs (e.g., customer vs client) with 60%+ attribute overlap
Consolidation Overlapping products merged into shared domain with discriminator column
Cross-Domain References Other domains use FK columns to reference the authoritative table — no duplication

The "Where Do I Go?" Test: For any entity, ask: "If a business user needs this information, is there exactly ONE place to get it?" If the answer is ambiguous, there is an SSOT violation.


Model Scopes: MVM vs. ECM

MVM (Minimum Viable Model) ECM (Expanded Coverage Model)
Size 30–50% of ECM table count Full coverage
Attribute depth SAME as ECM — full production-grade (same min/max per tier) Full production-grade
Domains Essential business functions only All functions including corporate back-office
Ideal for SMBs, rapid deployments, POCs, dev/test Fortune 100, multinational enterprises
Lightness Fewer domains & tables (NEVER thinner attributes) Maximum breadth

MVM is NOT a skeleton or demo toy. It is a production-ready subset where every delivered table is fully-featured.

Subdomains

Every domain is organized into subdomains — semantic groupings of related products within a domain. Subdomains provide an additional organizational layer between domains and products.

Subdomain Rules

Rule Description
Count Minimum and maximum subdomains per domain are defined by the complexity tier. Never exactly 1 subdomain per domain.
Naming EXACTLY 2 words per subdomain name. 1 word = rejected. 3+ words = rejected.
Min Products Every subdomain must contain at least the minimum products per subdomain defined by the tier: ECM tier_1–tier_4: 3, ECM tier_5: 2; MVM tier_1–tier_3: 3, MVM tier_4–tier_5: 2.
No Overlap No two subdomains within the same domain may share any word in their names.
Business Terms Use business terminology, NOT technical terms.
Balanced Products distributed as evenly as possible across subdomains.
No Placeholders NEVER use: "Sub Domain1", "Category 1", "Group A", "N/A", "Other", "General", "Miscellaneous", "Uncategorized".
No Drift Each subdomain belongs to exactly one parent domain.

Example: In the party domain:

  • identity: individual, organization, party_identification, kyc_verification
  • engagement: party_interaction, consent_record, loyalty_enrollment

Industry Complexity Tiers

The agent auto-classifies your business into one of five tiers:

Classification is based on 7 scoring dimensions (scored 0 or 1 each):

  1. Regulatory density — 3+ distinct regulatory bodies imposing data/reporting requirements
  2. Party complexity — 3+ distinct party types (customers, suppliers, partners, employees, etc.)
  3. Product hierarchy depth — 50+ product variants with complex bundling/pricing/lifecycle
  4. Infrastructure management — Owns physical or digital infrastructure requiring asset tracking
  5. Industry canonical model — 200+ entity types defined by an industry standards body
  6. Transaction complexity — 10+ distinct transaction types with multi-step lifecycles
  7. Operational system landscape — 5+ major systems of record across business functions

Rule: If the business falls between two tiers, classify UP (prefer the more complex tier).

Tier Label Hallmarks ECM Domains ECM Products/Domain Attrs/Product Subdomains/Domain
tier_1 Ultra-Complex 5+ regulatory bodies, multi-entity structures (banking, insurance, pharma) 15–22 14–28 15–50 3–6
tier_2 Complex 2–4 regulatory bodies, multi-channel (telecoms, energy, healthcare) 12–18 14–26 12–50 2–5
tier_3 Moderate 1–2 regulatory bodies, 3+ business functions (manufacturing, retail) 10–15 12–24 10–45 2–5
tier_4 Standard Light regulation, regional complexity (logistics, agriculture) 8–12 10–20 10–40 2–4
tier_5 Simple Minimal regulation, service-based (consulting, SaaS, media) 5–8 8–18 8–35 2–4

MVM Tier Sizing (attribute depth is the SAME as ECM):

Tier MVM Domains MVM Products/Domain Subdomains/Domain
tier_1 9–14 8–16 2–4
tier_2 8–12 8–14 2–4
tier_3 6–10 7–13 2–4
tier_4 5–8 6–11 2–3
tier_5 3–6 5–10 2–3

MVM counts are automatically derived as ~30–50% of the ECM counts for each tier. Attribute depth (min/max attributes per product) is the same for both MVM and ECM within each tier.


🚀 Getting Started

Quick Start Recipes

Recipe 1: Generate Your First Model (Minimal Input)
Widget Value
01. Business My Company Name
02. Description A brief description of what your company does
03. Operation new base model
05. Model Scope Minimum Viable Model - MVM
09. Installation Catalog my_catalog (or leave blank for logical-only)
Everything else Defaults

Run the notebook. Done.

Recipe 2: Generate a Rich Model (Recommended)
Widget Value
01. Business Contoso Manufacturing
02. Description A multinational aluminum smelting company operating across 12 countries with 15,000 employees
03. Operation new base model
05. Model Scope Expanded Coverage Model - ECM
06. Business Domains production, quality, supply, customer, sales, logistics, billing
07. Org Divisions Operations, Business and Corporate
09. Installation Catalog contoso_dev
10. Sample Records 10
Recipe 3: Vibe an Existing Model
Widget Value
01. Business Contoso Manufacturing
02. Description (same as before)
03. Operation vibe modeling of version
04. Version 1
08. Model Vibes Add a warranty domain. Remove the corporate_strategy domain. Run quality checks.
Recipe 4: Use Auto-Generated Next Vibes

After any pipeline run, find next_vibes.txt in the vibes/ output folder. Copy the suggested vibes into widget 08.

Widget Value
01. Business Contoso Manufacturing
02. Description (same as before)
03. Operation vibe modeling of version
04. Version Next version number (previous + 1)
08. Model Vibes (paste recommended vibes from next_vibes.txt)
Recipe 5: Deploy to a New Catalog
Widget Value
03. Operation install model
09. Installation Catalog prod_catalog
11. Model JSON File /Volumes/catalog/schema/volume/vibes/contoso/v1_ecm/model.json
10. Sample Records 0 (or a number for test data)

⚙️ Operations

The 03. Operation widget selects which pipeline mode to run:

Operation Purpose Key Requirements
new base model Generate a brand-new data model from scratch Business name + description (via widgets)
vibe modeling of version Apply natural-language instructions to refine an existing version Version + vibes (widget 08)
shrink ecm Convert an ECM to a leaner MVM Version + deployment catalog
enlarge mvm Expand an MVM into a comprehensive ECM Version + deployment catalog
install model Deploy a logical model into physical Unity Catalog objects Model JSON file (widget 11) + deployment catalog
uninstall model version Remove a version's physical artifacts from the catalog Business name + version + catalog
generate sample data Generate synthetic sample records for an existing deployed model Model JSON file (widget 11) + deployment catalog

🎵 The Vibing Workflow

This is the core power of Vibe Modelling — iterative refinement via natural language:

graph LR
    A["🆕 New Base Model<br/>Version 1"] --> B["🔍 Review Output"]
    B --> C["📝 Write Vibes<br/><i>or use auto-generated next_vibes.txt</i>"]
    C --> D["🎵 Vibe Modeling<br/>Version 2"]
    D --> E["🔍 Review"]
    E --> F["🎵 Vibe v3"]
    F --> G["..."]
    G --> H["✅ Deploy"]
Loading

What Vibes Can Do

Vibes are free-form natural language. The agent interprets them and translates them into specific actions:

"Add a compliance domain with regulatory_filing and audit_trail tables"
"Remove the HR domain, we don't need it"
"Merge the customer_support domain into the customer domain"
"Add a source_system column to every table"
"Run a full quality check and fix any issues found"
"The order table should have a shipping_address_id FK to the address table"
"Rename all tables starting with dim_ to remove that prefix"
"Normalize the order domain to 3NF"
"Mark all email columns as PII"
"Generate an ontology (RDF/RDFS) for the model"
"Keep only billing-related tables, drop everything else"
"Make this model ECM — I need the large version"

Auto-Generated Next Vibes

After every pipeline run, the agent produces next_vibes.txt and current_vibes.txt in the vibes/ folder, containing:

  • Your current business context (preserved as-is)
  • Recommended vibes for the next iteration (based on QA findings)
  • Model health metadata: confidence score, warning counts, issue breakdown
  • Version history and progression tracking

To use them: copy the recommended vibes into widget 08. Model Vibes, set operation to vibe modeling of version, set the next version, and run.


🎛️ Widget Reference

The notebook exposes 28 configurable widgets. Below is the complete reference.

Core Configuration Widgets (01–11)

# Widget Type Mandatory Default Description
01 Business (name) Text Yes Your business/organization name
02 Description Text Recommended What your business does (richer = better model)
03 Operation Dropdown Yes new base model Pipeline operation to run
04 Version Dropdown Conditional1 Model version number (1–100)
05 Model Scope Dropdown Yes Minimum Viable Model - MVM MVM (lean) or ECM (comprehensive)
06 Business Domains Text No Comma-separated seed domains
07 Included Org Divisions Dropdown Yes Operations and Business Which divisions to include
08 Model Vibes Text Conditional2 Natural language instructions — inline (max 2,000 chars) or file path to a .txt on a UC Volume
09 Installation Catalog Text Conditional3 Unity Catalog for physical deployment
10 Sample Records Dropdown No 0 Synthetic records per table (0 = none)
11 Model JSON File Text Conditional4 Path to a previously generated model.json for re-install or continuation

1 Required for all operations except new base model (auto-assigned). 2 Required for vibe modeling of version. 3 Required for install, uninstall, generate sample data, shrink, enlarge. Optional for new base model and vibe modeling. 4 Required for install model, generate sample data when re-deploying a previously generated model.

Detailed Widget Descriptions (click to expand)

01. Business (name)

The name of your business. Used as the top-level identifier across the entire model. Case-insensitive matching (stored via LOWER()).

Sample values: Contoso Inc, Acme Healthcare, Global Telecom Corp, NextGen Retail

02. Description

A rich description of what your business does. Include industry, size, geography, key products/services. The LLM uses this to determine the industry complexity tier and tailor the model.

Sample values: A multinational aluminum smelting and manufacturing company operating across 12 countries, A digital-first healthcare provider specializing in telemedicine

03. Operation

Options: new base model | vibe modeling of version | shrink ecm | enlarge mvm | install model | uninstall model version | generate sample data

See the Operations section for full details on each mode.

04. Version

Options: Empty, or 1100

For new base model, leave empty — auto-assigned as 1 (or auto-incremented). For vibe modeling of version, this is the version you are modifying; output creates version N+1.

05. Model Scope

Options: Minimum Viable Model - MVM | Expanded Coverage Model - ECM

MVM = lean core (fewer domains/tables, same attribute depth). ECM = comprehensive Fortune 100 coverage.

06. Business Domains

Comma-separated list of specific domains you want. If blank, the LLM auto-generates the optimal set for your industry.

Sample values: customer, sales, billing, inventory, logistics | clinical, pharmacy, research

07. Included Org Divisions

Options: Operations | Operations and Business | Operations, Business and Corporate

Controls which organizational divisions contribute domains to the model.

08. Model Vibes

Supports two input modes:

  • Inline text — type your vibe instructions directly into the widget (max 2,000 characters). Best for short, targeted changes.
  • File path — provide a path to a .txt file on a Unity Catalog Volume (e.g., /Volumes/catalog/schema/vol/my_vibes.txt). Best for longer, multi-paragraph instructions that exceed the inline limit.

See The Vibing Workflow for examples of what vibes can do.

09. Installation Catalog

The Unity Catalog where physical schemas, tables, FK constraints, tags, and sample data will be created. Must already exist. You need CREATE SCHEMA privileges. If blank for new base model/vibe modeling, only the logical model (JSON artifacts) is produced.

Sample values: dev_catalog, prod_data_models, contoso_lakehouse

10. Sample Records

Options: 0, 5, 10, 15, 20, 25, 50, 100

0 = skip sample data generation. 10 = good default for review. 50100 = for load testing / demos. Sample data respects FK relationships (child records reference valid parent IDs).

11. Model JSON File

(Optional) Path to a previously generated model.json file on a Unity Catalog Volume. Used when re-installing or continuing from a prior run. After every run, the agent generates next_vibes.txt in the vibes/ folder — review it for recommended next vibe instructions.

Sample values: /Volumes/my_catalog/my_schema/vol/vibes/contoso/v1_ecm/model.json

Model Convention Widgets (12–24)

# Widget Type Default Options / Format
12 Naming Convention Dropdown snake_case snake_case, camelCase, PascalCase, SCREAMING_CASE
13 Primary Key Suffix Dropdown _id _id, _key, _pk, id, key
15 Schema Prefix Text (empty) e.g., stg_, raw_, dw_
16 Tag Prefix Text dbx_ e.g., dbx_, vibe_, mdl_
17 Table ID Type Dropdown BIGINT BIGINT, INT, LONG, STRING
18 Boolean Format Dropdown Boolean (True/False) Boolean (True/False), Int (0/1), String (Y/N)
19 Date Format Dropdown yyyy-MM-dd yyyy-MM-dd, dd/MM/yyyy, MM/dd/yyyy, yyyy/MM/dd, dd-MM-yyyy
20 Timestamp Format Dropdown yyyy-MM-dd'T'HH:mm:ss.SSSXXX 4 ISO/standard options
21 Classification Levels Text restricted=restricted, confidential=confidential, internal=Internal, public=public Comma-separated key=Label pairs
22 Housekeeping Columns Dropdown No No, Yes — adds created_by, created_at, updated_by, updated_at
23 History Tracking Columns Dropdown No No, Yes — adds valid_from, valid_to, is_current (SCD Type 2)
09a Cataloging Style Dropdown One Catalog One Catalog, Catalog per Division, Catalog per Domain
09b Catalog Prefix Text (empty) e.g., dev_, prod_
09c Catalog Suffix Text (empty) e.g., _lakehouse, _dw
15a Schema Suffix Text (empty) e.g., _db, _schema
16a Tag Suffix Text (empty) e.g., _tag
24 Vibe Session ID Text (empty) UUID for external UI progress tracking

Widget #14 does not exist (numbering gap between 13 and 15).


📄 Auto-Generated Next Vibes

After every run, the agent produces next_vibes.txt in the vibes/ output folder with recommended next vibe instructions. Copy these into widget 08. Model Vibes for the next iteration.

The model.json file includes session metadata:

{
  "_next_vibe_metadata": {
    "generated_from_version": "v1_mvm",
    "model_version": "1",
    "status": "needs_work",
    "confidence_score": 78,
    "summary": "Model has 3 unlinked columns and 1 siloed table",
    "issues_addressed": ["..."],
    "issues_not_addressed": ["..."],
    "data_modeler_notes": "Recommend adding a warehouse domain",
    "model_stats_at_generation": { "domains": 8, "products": 47, "attributes": 523 },
    "issue_counts": { "error": 0, "warning": 3, "info": 5 },
    "version_history": ["..."]
  }
}

🎯 The Complete Action Catalog

Vibes are translated into 190+ specific actions organized into categories. You never need to name these actions — just describe what you want in natural language and the agent maps your intent.

Entity Management (1–25)
# Action What It Does
1 drop Remove a domain, product, attribute, tag, or link
2 create Add a new domain, product, or attribute
3 rename Change the name of an entity
4 alter_description Modify the description
5 change_type Change an attribute's data type
6 merge Combine two entities into one
7 split Divide one entity into multiple
8 move_product Move a product to another domain
9 move_attribute Move an attribute to another product
10 delete_attribute Remove an attribute
11 create_link Create a foreign key relationship
12 drop_link Remove a foreign key relationship
13–17 Prefix/suffix ops Add, remove, or change name prefixes/suffixes
18–24 Metadata ops Update glossary terms, references, regex, tags
25 modify Regenerate an entity with specific guidance
Quality Check & Analysis (26–38)
# Action What It Does
26 run_quality_checks Compound: runs detect_duplicates + dedupe_attributes + detect_cycles + detect_siloed + fix_fk_anomalies
26b run_product_domain_fit LLM audit: are products in the right domains? Relocates misplaced ones.
27 detect_duplicates Find SSOT violations (semantic duplicate products across domains)
28 fix_duplicates Auto-merge/remove detected duplicates
29 dedupe_attributes Remove duplicate columns within products
30 detect_cycles Find circular FK dependencies
31 break_cycles Auto-fix cycles by removing weakest FK links
32 detect_siloed Find completely disconnected tables (no FKs in any direction)
33 fix_siloed Connect disconnected tables via linking attempts
34 review_links Audit all FK relationships for anomalies
35 fix_fk_anomalies Repair broken, mismatched, or orphaned FK references
36 fix_ambiguous_fks Resolve FKs that match multiple target tables
37 merge_small_tables Analyze and consolidate tables with < 5 attributes
38 identify_core_products Identify business-critical products for protection
95 model_checkup Mega-compound: runs static analysis + auto-queues ALL appropriate fixes
FK & Linking Operations (39–46, 133–138)
# Action What It Does
39 run_linking Compound: in-domain + cross-domain + M:N detection
40 run_in_domain_linking Link FKs within each domain
41 run_cross_domain_linking Link FKs across different domains
42 detect_many_to_many Find potential M:N relationships
43 create_junction_tables Create bridge/association tables for M:N
45 redirect_fk Redirect FK to a different target table
46 find_unlinked_columns Find *_id columns without FK relationships
133 remove_product_prefix Remove redundant table-name prefix from columns
134 fix_fk_column_naming Fix FK columns that don't end with the target PK name
135 connect_table Connect a specific disconnected table
136 link_specific_columns Link an explicit list of unlinked _id columns
138 find_missing_fk_links Comprehensive: LLM classifies each unlinked column as LINK, CREATE, DROP, or KEEP_AS_IS
Bulk Operations (62–64, 110–114)
# Action What It Does
62 bulk_rename_products Rename multiple products by pattern
63 bulk_drop_products Drop products matching a pattern (e.g., stg_*)
64 bulk_move_products Move products matching a pattern to another domain
110 bulk_change_type Change data type for attributes by pattern
111 bulk_set_nullable Set nullable for attributes by pattern
113 bulk_remove_attributes Remove attributes matching a pattern from all tables
Tag & Classification Operations (54–61, 98–106, 156–163)
# Action What It Does
54–57 Tag add/remove Add or remove tags from products or domains
58 conditional_tag Tag entities matching a condition
59–60 Bulk tagging Tag by name pattern or by column presence
98–101 PII/sensitive marking Mark as PII, sensitive, encrypted, deprecated
103 set_table_type Classify as dimension, fact, lookup, bridge, staging, archive
156 classify_table_tier Medallion architecture: bronze / silver / gold
161 set_data_owner Assign data owner/steward
162 set_update_frequency Document expected freshness (real_time, daily, monthly, etc.)
163 map_to_source_system Map table/column to source system (SAP, Salesforce, etc.)
Template Column Actions (148–155)
# Action Columns Added
148 add_scd_columns effective_from, effective_to, is_current, row_hash
149 add_audit_columns created_at, updated_at, created_by, updated_by
150 add_soft_delete_columns is_deleted, deleted_at, deleted_by
151 add_temporal_columns valid_from, valid_to, system_from, system_to
152 add_versioning_columns version_number, version_valid_from, version_valid_to, is_latest_version
153 add_multitenancy_columns tenant_id
154 add_lineage_columns source_system, source_table, ingestion_timestamp, etl_job_id
155 add_gdpr_columns consent_status, consent_date, data_subject_request_id, right_to_erasure_date
Structural Transformations (164–191)
# Action What It Does
164 normalize_to_3nf Apply Third Normal Form normalization (LLM-powered)
165 denormalize_for_analytics Create wide/denormalized tables for BI
184 promote_to_table Extract an attribute into its own lookup/reference table + FK
185 inline_table Merge a child table into its parent (denormalize)
186 swap_domains Atomically swap two domain names
187 impact_analysis Show what would break if a table/domain were dropped (read-only)
190 enlarge_model Wholesale expansion from MVM to ECM scope
191 shrink_model Wholesale reduction from ECM to MVM scope
178 VIBE_PRUNE_PROMPT LLM-powered: keep only tables related to a focus area
179 drop_domains_except Drop all domains except a specified keep-list
Artifact Generation (139–147)
# Action Output
139 generate_readme README documentation
140 generate_data_model_json Complete model as JSON
141 generate_ontology RDF/RDFS ontology (Turtle format)
142 generate_dbml DBML schema for visualization tools
143 generate_release_notes Changelog/release notes for the version
144 generate_excel Excel workbook export
145 generate_data_dictionary Comprehensive data dictionary
146 export_model_report Full model documentation report
147 generate_test_cases Data quality test case specifications
Metric View Operations (126–134)
# Action What It Does
126 run_metric_modeling Generate Databricks metric views (dimensions + measures)
129 add_metric_measure Add a specific KPI measure to metric views
130 remove_metric_measure Remove a measure
131 add_metric_dimension Add a grouping dimension
132 remove_metric_dimension Remove a dimension
133 alter_metric_filter Set/update metric view filter logic
134 drop_metric_view Remove an entire metric view

🔄 Pipeline Stages

When you run the agent, it executes these progress stages in order (the # column is for reference only — no numeric stage IDs are emitted in progress events):

# Stage Duration What Happens
1 Setup and Configuration 2–10s Validates inputs, creates metamodel tables
2 Interpreting Instructions 10–30s (Vibe mode only) Parses vibes into structured action plan
3 Collecting Business Context 10–30s LLM enriches your description across business/industry dimensions
4 Designing Domains 15–60s Generates domains following division model + SSOT
5 Creating Data Products 1–10m Products per domain with architect review
6 Enriching Data Products with Attributes 5–40m All columns for every product
7 Cross-Domain Linking 1–5m In-domain → global sweep → pairwise comparison
8 Quality Assurance 30s–3m 9 sub-checks (naming, PK/types, overlaps, topology, auto-remediation)
9 Applying Naming Conventions 10–30s Final naming consistency pass
10 Model Finalization 10–30s Finalize logical model snapshot before physical deployment
11 Subdomain Allocation 10–30s Allocates products into semantic subdomains
19 Generating Metric View Artifacts 10–30s Exports metric view definitions (legacy stage ID)
17 Generating Artifacts 30s–2m README, Excel/CSV, model JSON, data dictionary, model report
12 Physical Schema Construction 1–10m Creates UC schemas + Delta tables (if catalog set)
13 Applying Foreign Keys 30s–2m FK constraints on physical tables
14 Applying Tags 2–15m Classification, PII, data type tags on schemas/tables/columns
15 Applying Metric Views 30s–5m Databricks metric views for KPI tracking
16 Generating Sample Data 1–15m Synthetic records respecting FK relationships
18 Consolidation and Cleanup 10–30s Consolidates/merges metadata and cleanup

📦 Output Artifacts

Logical Artifacts (Always Generated)

File Description
model.json Complete model export (primary output)
readme.md Human-readable model documentation
vibes/current_vibes.txt Vibe instructions used in this run
vibes/next_vibes.txt Recommended next vibe instructions
domains.json All domain definitions
products.json All product (table) definitions
attributes.json All attribute (column) definitions
docs/*.xlsx / docs/*.csv Excel/CSV export of the entire model (CSV fallback if openpyxl unavailable)
readme.md (parent folder) Model overview document comparing MVM and ECM scopes
schemas/*.sql SQL DDL files (per-domain schemas, cross-domain FKs, catalogs)
diagram/*_dbml_*.txt DBML schema diagram
ontology/*_rdf_*.ttl RDF/Turtle ontology representation
docs/releasenotes.txt Auto-generated release notes

Conditional Artifacts (Generated on Demand via Queued Operations)

File Description
docs/*_data_dictionary_*.txt Column-level reference guide
docs/*_model_report_*.txt Full documentation report
docs/*_test_cases_*.txt Generated test cases

Physical Deployment (When Catalog Is Set)

Object Example
Schemas catalog.customer, catalog.sales, catalog.logistics
Tables Delta tables with all columns and correct Spark SQL types
FK Constraints Physical foreign key constraints between tables
Tags Unity Catalog tags on schemas, tables, and columns
Metric Views Databricks metric views for KPI calculation
Sample Data Synthetic records with valid FK references

🛡️ Quality Rules & Enforcement

The agent enforces a comprehensive rule system during generation and QA:

Rule Group Key Rules
Naming (G01) snake_case default; domains: 1 word, singular, max 20 chars; products: 1–3 words, max 30 chars, no domain prefix; attributes: max 50 chars, no product prefix (except PK); FKs must end with target PK name; preserve unit qualifiers
Semantic Dedup (G02) First-Class Entity Test (5 criteria); SSOT violation detection; 60%+ overlap → merge to shared; 90%+ overlap same domain → remove; attribute dedup >80% confidence; semantic distinction rules (method vs channel, target vs actual, lifecycle timestamps)
FK Integrity (G03) FK target must exist; no bidirectional FKs; DAG required (no cycles); FK type compatibility; system identifiers exempted; hierarchical self-refs exempt
PK Rules (G04) Every product has exactly 1 PK; PK = {product}_{suffix}; PK type = configured type (default BIGINT); PK exempt from prefix stripping
Normalization (G05) 3NF enforcement; orphaned FK detection; denormalized attribute detection with >95% confidence; point-in-time snapshots exempt; measurement attributes exempt; geographic coordinates on physical entities exempt
Division Balance (G06) Ops + Business ≥ 80% of domains; Corporate ≤ 20%; no Corporate until Ops AND Business each have ≥ 2; cross-division relocation forbidden; forbidden generic domain names; Org Chart Test; Fragmentation Test (30%+ overlap → merge)
Data Types (G07) Spark SQL types only; no ARRAY/STRUCT/MAP; no calculated metrics/aggregates
Tags (G08) Classification in tags only; PII = RESTRICTED + specific PII tag (pii_email, pii_phone, pii_financial, pii_health, pii_identifier, pii_address, pii_biometric, pii_name, pii_dob); empty tags for regular data; custom user tags always applied
Graph Topology (G09) DAG enforcement; DFS cycle detection; hierarchical self-ref exemption; zero siloed tables; each domain ≥2 cross-domain connections; parent-child FKs never broken in cycle resolution
Honesty Check (G11) 0–100 scale; below 55 = permanently discarded; 55–70 = borderline retry; ≥90 = accepted; contradiction penalty applied post-processing
Product Design (G12) M:N requires 3 indicators with ≥2 of 3 strong; association ratio ECM ≤15%, MVM ≤5%; core products 1–3 per domain; forbidden product suffixes (_analysis, _analytics, _report, _prediction); Silver Layer only (no analytics products)
Sample Data (G13) Exact record count; sequential PK from 10001; FK random [10001, 10001+N-1]; regex compliance; 3-letter country codes; no Lorem Ipsum; realistic business data
Vibe Constraints (G14) Dedup overlap thresholds; max relocation percentage per pass; normalization confidence 95%; mutation budgets by mode (surgical, holistic, generative); domain hard ceiling factor 1.5x
Physical Schema Deployment (G15) Schema and table creation factories; attribute dict factory; product dict factory; FK dependency ordering; consolidation and cleanup
Subdomains Exactly 2-word names; min products per subdomain per tier; no overlapping words; balanced distribution; no placeholder names

🤖 LLM Architecture

The agent uses a multi-model ensemble with automatic demotion and recovery:

Order Role Purpose Endpoint Input Tokens Output Tokens
10 Thinker (large) Complex reasoning, architecture reviews, QA decisions databricks-claude-opus-4-6 200,000 128,000
20 Worker (large) High-volume generation: products, attributes, FKs, dedup databricks-claude-sonnet-4-6 200,000 64,000
30 Thinker (large) Fallback thinker databricks-claude-opus-4-5 200,000 64,000
40 Worker (large) Fallback worker databricks-claude-sonnet-4-5 200,000 64,000
50 Worker (small) Simpler tasks: domain generation, tag classification databricks-gpt-oss-120b 131,072 25,000
60 Worker (tiny) Sample data generation databricks-gpt-oss-20b 131,072 25,000

Automatic model demotion: After 3 cumulative failures on a model, the entire priority order is demoted — the failing model is pushed down and the next model takes its place. After 5 consecutive successes, the original order is restored. After 3 consecutive timeouts, a model is marked broken and skipped for the rest of the session.

Prompt architecture: 49 specialized prompts, each mapped to a specific model role and temperature setting. Thinker prompts use temperature 0 for deterministic reasoning. Worker prompts use temperature 0–0.3 for controlled generation. Sample generation uses temperature 0.5 for creative variety.


📊 Metric Views

The agent generates Databricks metric views — reusable KPI definitions that sit on top of your data tables:

Component Description Example
Dimensions Grouping columns region, product_category, fiscal_quarter
Measures Single-aggregate expressions SUM(revenue), COUNT(DISTINCT customer_id), AVG(order_value)
Filters Row-level predicates WHERE status = 'completed'

Metric views are auto-generated per domain, focusing on KPIs that would appear in executive dashboards and quarterly business reviews. Each measure uses a single aggregate function (nested aggregates like AVG(SUM(...)) are not supported by metric view YAML and are automatically prevented).


🔧 Troubleshooting

Symptom Likely Cause Resolution
"No deployment catalog specified — skipping physical model deployment" Widget 09 is empty Set deployment catalog to deploy physical tables
Model has too few domains MVM scope + tier_5 classification Switch to ECM, or provide seed domains in widget 06
Model has irrelevant domains LLM inferred from description Vibe: "Remove the X domain"
FK pointing to wrong table LLM linked incorrectly Vibe: "Redirect order.warehouse_id FK to logistics.warehouse"
Circular dependency warning DAG violation Agent auto-fixes during QA; if persistent: "Break cycles"
SSOT violation (duplicate entities) Same concept in multiple domains Vibe: "Run quality checks" or "Fix duplicates"
Pipeline crashed mid-run LLM timeout or transient error Re-run with same inputs — agent cleans up incomplete versions
Model JSON file parse error Invalid JSON (smart quotes, trailing commas) Validate JSON; use straight quotes only
"Version X exists but is incomplete" Previous run failed Agent auto-detects — just re-run
Model too large / too many tables ECM + high complexity tier Use shrink ecm, or vibe: "Keep only core business tables"
Convention changes not applied Convention widgets not set Update the model convention widgets (12–24) with your desired values before running
Metric views have COUNT(1) instead of real KPIs Nested aggregates were auto-replaced LLM prompt prevents this; re-run metrics: "Regenerate metrics"

📘 Glossary

Term Definition
Attribute A column in a product (table) — has name, type, tags, description
Business Data Model A data model tailored to a specific organization, generated by Vibe Modelling
Corporate Division Supporting functions (HR, Finance, Legal) that enable but don't directly generate revenue
DAG Directed Acyclic Graph — the required topology for FK relationships (no cycles)
Division Top-level organizational grouping: Operations, Business, or Corporate
Domain A logical grouping of related tables, deployed as a Unity Catalog schema
ECM Expanded Coverage Model — comprehensive enterprise-grade model scope
FK Foreign Key — a column referencing another table's primary key
Honesty Check LLM self-assessment score (0–100%) to ensure output quality
Industry Data Model A generic, vendor-published schema template for an entire industry vertical
Junction Table An association table resolving M:N relationships between two entities
Metric View A Databricks KPI definition with dimensions, measures, and filters
MVM Minimum Viable Model — lean, production-ready scope (30–50% of ECM)
PK Primary Key — the unique identifier column for a table
Product A data table within a domain — a first-class business entity
SSOT Single Source of Truth — each business concept has one and only one authoritative owner
Tier Industry complexity classification (tier_1 = Ultra-Complex to tier_5 = Simple)
Vibe A natural language instruction provided to the agent to modify the model
Vibe Modelling The iterative process of generating and refining data models using natural language

Built on Databricks Serverless Compute with Unity Catalog governance

About

Vibe Modelling Agent

Resources

License

Stars

Watchers

Forks

Packages