Data modeling is the discipline of translating business processes, operational constraints, and analytical needs into durable data structures. For a senior data engineer, it is not a documentation exercise. It is a control surface for correctness, performance, governance, cost, and long-term change velocity.
Well-modeled systems make downstream work easier: analytics becomes interpretable, machine learning features become reproducible, and operational services can evolve without every schema change turning into a breaking event. Poor modeling creates the opposite outcome: brittle joins, duplicate logic, silent metric drift, and expensive rework.
Data modeling defines:
- what entities exist
- how those entities relate to each other
- what business events should be recorded
- what grain and history should be preserved
- how the model should behave under growth, change, and audit requirements
At enterprise scale, modeling decisions directly affect:
- query performance
- storage footprint
- semantic consistency across teams
- data quality and testability
- compliance and traceability
- feature reproducibility for ML
- the effort required to onboard new consumers
The correct question is rarely "What is the ideal schema?" The better question is "What model best serves the operating and analytical reality of this business with acceptable complexity?"
These are separate layers. Teams that skip that separation usually mix business semantics with engine-specific implementation too early.
| Model Type | Purpose | Audience | Typical Content | Common Mistake |
|---|---|---|---|---|
| Conceptual | Define core business concepts and boundaries | Business stakeholders, architects | Customer, Order, Product, Subscription | Treating conceptual entities as tables too early |
| Logical | Define attributes, relationships, keys, and business rules | Data engineers, analytics engineers, architects | Entity relationships, cardinality, optionality, domains | Ignoring grain and temporal rules |
| Physical | Implement for a specific engine and workload | Data engineers, DBAs, platform engineers | Partitioning, clustering, indexes, file layout, compression | Copying logical design without tuning for access patterns |
For a commerce platform:
- Conceptual model:
Customer,Order,Payment,Shipment - Logical model:
Orderhas manyOrderLine;Paymentbelongs to oneOrder; anOrdermay have multiple status changes over time - Physical model:
- OLTP database with normalized
orders,order_lines,payments - warehouse fact table
fact_orders - dimensions
dim_customer,dim_product,dim_date
- OLTP database with normalized
If you jump directly into physical tables, you usually encode the first use case instead of the stable business model.
Operational and analytical systems should not be modeled the same way by default.
| Characteristic | OLTP Modeling | OLAP Modeling |
|---|---|---|
| Primary goal | Correct transactional updates | Fast reads and aggregations |
| Typical workload | Point lookups, inserts, updates, deletes | Scans, joins, aggregates, time-series analysis |
| Data shape | Normalized | Denormalized or dimensional |
| Latency target | Milliseconds for transactions | Seconds to minutes for broad queries |
| History handling | Often current-state oriented | Usually historical and audit-friendly |
| Typical engines | Postgres, MySQL, SQL Server | Snowflake, BigQuery, Redshift, Databricks, DuckDB |
- user-facing applications
- order management
- payment processing
- inventory reservation
- systems that require transactional integrity and row-level updates
- executive reporting
- dashboard workloads with repeated multi-table joins
- feature generation over large historical windows
- self-service analytics
- recurring reporting
- BI tools
- experimentation analysis
- ML training data creation
- historical trend analysis
- write-heavy transactional services
- systems requiring complex per-row updates with low latency
Common mistake: forcing a normalized application schema into the warehouse and asking analysts to rebuild business semantics in every query.
Keys determine both integrity and usability. Weak key design creates duplicate facts, broken joins, and unstable lineage.
Primary keys uniquely identify a row within a table. In analytical platforms, they are often logical rather than physically enforced, so teams must validate them with tests.
Foreign keys express relationships between entities. Even if the warehouse does not enforce them, the model should declare them semantically and validate them in tests or contracts.
Surrogate keys are synthetic identifiers, typically integers, hashes, or UUIDs, created independently of business meaning.
Use surrogate keys when:
- natural keys are large or unstable
- multiple source systems collide on identifiers
- dimensions need versioned rows
- you need warehouse-friendly joins
Natural keys come from the business domain, such as email, order_number, or country_code.
Use natural keys when:
- the identifier is truly stable
- the business already reasons in that identifier
- cross-system portability matters
Do not confuse familiarity with stability. Email addresses, SKU names, and human-readable codes often change more than teams expect.
| Key Type | Pros | Cons | Typical Use |
|---|---|---|---|
| Natural key | Business-readable, easy for debugging | May change, may not be globally unique | Stable reference data |
| Surrogate integer | Efficient joins, compact storage | Requires mapping logic | Warehouse dimensions |
| UUID | Easy distributed generation | Large, less efficient for some workloads | Event IDs, distributed systems |
| Hash key | Useful for dedupe and integration | Collision risk, opaque to users | Data Vault hubs, lakehouse merge keys |
- using mutable business fields as primary keys
- joining facts to dimensions on descriptive text
- treating composite natural keys casually and then dropping part of the grain later
- assuming uniqueness without validating it continuously
Normalization reduces redundancy and update anomalies. Denormalization reduces join cost and improves read performance. Neither is universally better.
| Topic | Normalized Model | Denormalized Model |
|---|---|---|
| Read performance | Lower for wide analytical queries | Higher for common analytics queries |
| Write integrity | Stronger | Weaker unless carefully managed |
| Redundancy | Lower | Higher |
| Ease of change | Better for transactional domains | Better for stable reporting domains |
| Ease for analysts | Lower | Higher |
| Typical use | OLTP systems | Data marts, serving layers |
- transactional systems
- master data domains
- domains with frequent updates
- integration layers where business rules are still evolving
- BI-serving layers with repeated joins
- high-concurrency dashboard queries
- teams with many self-service analytics users
- star schemas
- aggregate marts
- feature tables
- reporting tables tuned for repeated access patterns
- volatile operational data with frequent corrections
- domains with weak source governance
- early integration stages where semantics are not stabilized
- denormalizing too early before the grain is understood
- keeping every historical attribute in one giant "gold" table
- normalizing a warehouse so aggressively that every dashboard requires ten joins
Dimensional modeling is still the default recommendation for business analytics and BI because it optimizes for understandable queries, stable metrics, and predictable performance.
Facts represent measurable business events at a specific grain.
Examples:
- one order line per row
- one click event per row
- one daily account balance per row
Good fact tables define:
- exact grain
- additive or semi-additive measures
- event timestamp
- foreign keys to dimensions
Dimensions describe the context around facts.
Examples:
- customer
- product
- merchant
- campaign
- date
Dimensions should carry descriptive attributes used for filtering, grouping, and slicing metrics.
A star schema has a central fact table with directly connected denormalized dimensions.
dim_customer dim_product dim_date
\ | /
fact_order_line
|
dim_channel
- simple for analysts
- good BI performance
- clear metric ownership
- works well with semantic layers
- some redundancy in dimensions
- less ideal for highly volatile source integration
- dashboarding
- recurring KPI reporting
- experimentation analysis
- finance and commercial reporting
- mixing multiple grains into one fact table
- using dimensions as dumping grounds for unrelated attributes
- embedding non-conformed business logic inconsistently across marts
A snowflake schema normalizes one or more dimensions into additional related tables.
Use it when:
- dimensions are very large
- attribute hierarchies are reused heavily
- you have governance reasons to centralize dimension substructures
Do not choose snowflake by default for BI. It usually increases query complexity faster than it increases value.
Historical behavior is a modeling choice, not an accident.
| SCD Type | Behavior | Use Case | Main Risk |
|---|---|---|---|
| Type 0 | Keep original value forever | Immutable attributes | Limited correction flexibility |
| Type 1 | Overwrite current value | Typos, non-historical cleanup | Loses history |
| Type 2 | New row per change with effective dates | Customer tier, product category changes | Larger dimensions and more complex joins |
| Type 3 | Add previous value columns | Limited before/after analysis | Poor scalability for many changes |
| Type 4 | Current table plus history table | Large history management separation | Dual-table complexity |
| Type 6 | Hybrid of 1, 2, and 3 | Complex reporting with current and historical views | High implementation complexity |
Bad model:
fact_orderscontains one row per order- revenue, product category, coupon code, and shipment carrier are all mixed together
- analysts later try to answer product-level and shipment-level questions from the same table
Better model:
fact_order_lineat one row per purchased itemdim_customerdim_productdim_orderdim_date
This allows revenue by product, by channel, by customer cohort, and by fulfillment path without re-deriving grain later.
Data Vault is an integration-oriented modeling approach optimized for auditability, historical traceability, and parallel ingestion from many source systems.
Core structures:
Hub: unique business keysLink: relationships between business keysSatellite: descriptive context and change history
Hub_Customer(customer_bk, customer_hk, load_ts, record_source)
Hub_Order(order_bk, order_hk, load_ts, record_source)
Link_Customer_Order(customer_hk, order_hk, link_hk, load_ts)
Sat_Customer_Profile(customer_hk, customer_name, segment, effective_ts)
Sat_Order_Status(order_hk, status, amount, effective_ts)
- many upstream systems with changing schemas
- strict lineage and audit requirements
- enterprise integration programs
- long-lived platforms where source systems will continue changing
- small teams that need dashboards quickly
- simple single-source domains
- self-service analytics as the primary outcome
- resilient to source change
- good historical traceability
- supports incremental delivery across domains
- separates business keys from descriptive context cleanly
- harder for analysts to consume directly
- more tables and more joins
- requires disciplined automation
- can create architectural ceremony if applied to small problems
- exposing raw vault structures directly to BI users
- implementing Data Vault without automation standards
- assuming Data Vault replaces dimensional marts
In practice, many enterprise teams use Data Vault for integration and lineage, then publish dimensional or domain-serving marts for consumption.
Different consumers need different shapes of truth.
| Consumer | Model Preference | Why |
|---|---|---|
| Operational application | Normalized OLTP | Transaction correctness and efficient updates |
| BI dashboards | Dimensional marts | Predictable joins and business-friendly semantics |
| Ad hoc analytics | Wide marts plus reusable conformed dimensions | Speed with enough semantic control |
| Machine learning training | Event-level tables, snapshot tables, feature views | Point-in-time correctness and reproducibility |
| Online inference | Narrow, low-latency feature serving model | Fast lookup and freshness guarantees |
Prefer:
- clear grain
- conformed dimensions
- explicit metric definitions
- marts aligned to business processes, not source applications
Prefer:
- event timestamps preserved at source grain
- point-in-time safe joins
- leakage prevention
- stable identifiers across training and serving
- features separated from labels and outcomes
Common ML mistake: training on a denormalized "latest state" table that accidentally includes future information.
Prefer:
- explicit transactional boundaries
- normalized structures
- referential integrity
- status transitions modeled as events where audit matters
Lakehouse environments change the physical implementation, not the need for good modeling.
What changes in a lakehouse:
- storage is typically file-based object storage
- tables may be backed by Delta Lake, Iceberg, or Hudi
- batch and streaming often converge onto the same table abstractions
- schema evolution is easier technically, which can tempt teams into weak governance
- bronze for landed raw data
- silver for cleaned, conformed, business-stable entities
- gold for marts, aggregates, and high-value serving tables
| Topic | Warehouse-Centric Modeling | Lakehouse-Centric Modeling |
|---|---|---|
| Storage abstraction | Managed warehouse tables | Open table formats on object storage |
| Change handling | Often warehouse-native DML | Merge/upsert plus incremental file compaction |
| Performance tuning | Distribution, sort keys, clustering | Partitioning, clustering, file size, compaction |
| Governance risk | More centralized controls | Easier sprawl if contracts are weak |
| Typical advantage | Simplicity for BI | Flexibility across BI, DS, and ML workloads |
- mixed analytics, data science, and ML workloads
- open storage format strategy
- large-scale raw and curated data retention
- multi-engine access requirements
- small analytics footprint where a managed warehouse is simpler
- organizations without strong metadata and governance discipline
Schema evolution should be intentional, versioned, and observable.
- define canonical entity grain before adding columns
- separate ingestion schemas from serving schemas
- version contracts for event payloads and external tables
- prefer additive changes over breaking changes
- make nullability a deliberate decision, not a default
- capture effective dates and ingestion timestamps
- document business definitions alongside physical schema
| Change Type | Recommendation |
|---|---|
| Add nullable column | Usually safe if semantics are clear |
| Rename column | Prefer additive migration with deprecation window |
| Change meaning of an existing field | Create a new field; do not silently repurpose |
| Split one table into many | Publish compatibility views during migration |
| Change event key or grain | Treat as a contract change and version it explicitly |
- allowing raw event producers to redefine field semantics without warning
- merging two source attributes into one warehouse column because they "look similar"
- overusing generic
metadataJSON instead of modeling governed columns
Partitioning and clustering are physical design choices. They should follow workload evidence, not vendor marketing.
Use partitioning when:
- queries prune large date or domain ranges
- ingestion naturally arrives in partitionable windows
- maintenance operations benefit from scoped rewrites
Avoid over-partitioning. Tiny partitions create metadata overhead and poor scan efficiency.
Good partition candidates:
- event date
- ingestion date
- region for very large datasets
Bad partition candidates:
- high-cardinality user IDs
- volatile fields
- fields rarely used for pruning
Clustering improves locality for frequently filtered or joined columns.
Use clustering when:
- partitions are still large
- query predicates are consistent
- engine supports effective clustering maintenance
- partitioning by timestamp at hour-level when query volumes do not justify it
- never compacting files in lakehouse tables
- choosing partition columns before understanding consumption patterns
Modeling decisions either surface or hide quality problems.
Examples:
- no declared grain leads to duplicate facts
- weak keys cause join amplification
- overloaded dimension tables create inconsistent definitions
- denormalized wide tables hide source-of-truth ownership
- late-arriving data without effective timestamps breaks historical accuracy
| Modeling Concern | Quality Check |
|---|---|
| Primary key uniqueness | Unique and not-null tests |
| Foreign key integrity | Relationship tests |
| Fact grain | Duplicate detection on grain-defining columns |
| SCD behavior | Non-overlapping effective date tests |
| Event ordering | Late-arrival and watermark monitoring |
| Schema stability | Contract validation and drift alerts |
Good models make quality checks obvious. If the testing strategy is hard to define, the model is usually underspecified.
| Topic | Dimensional Modeling | Data Vault |
|---|---|---|
| Primary goal | Business consumption and analytics | Enterprise integration and history |
| Consumer friendliness | High | Low to medium |
| Change tolerance | Moderate | High |
| Time to first dashboard | Fast | Slower |
| Auditability | Good if designed well | Excellent |
| Number of tables | Lower | Higher |
| Best fit | BI, semantic layers, reporting marts | Multi-source ingestion, lineage-heavy programs |
Choose dimensional modeling when the main problem is consistent business reporting.
Choose Data Vault when the main problem is integrating many changing source systems with traceable history.
In many enterprises, the right answer is both:
- Data Vault or equivalent in the integration layer
- dimensional marts in the presentation layer
| Topic | Normalized | Denormalized |
|---|---|---|
| Best for | Operational correctness | Analytical usability |
| Redundancy | Low | Higher |
| Write complexity | Lower anomaly risk | Higher anomaly risk |
| Read complexity | Higher | Lower |
| Change management | Stronger in transactional domains | Stronger in reporting domains |
Choose normalized structures for transactional domains and canonical integration entities.
Choose denormalized structures when the same joins are executed repeatedly by analysts, dashboards, or serving layers.
| Topic | Warehouse Modeling | Lakehouse Modeling |
|---|---|---|
| Main bias | Managed analytical simplicity | Flexible multi-engine platform |
| Governance posture | Often centralized | Must be designed explicitly |
| Change handling | More warehouse-managed | More platform-engineering responsibility |
| ML alignment | Good, but sometimes secondary | Usually better aligned with DS and ML workloads |
If the organization is analytics-first and wants a tightly managed platform, warehouse-centric modeling is often the better operating model.
If the organization needs shared access across analytics, data science, ML, and streaming use cases, lakehouse modeling often provides more leverage, but only if metadata and table-management discipline are strong.
- one giant "gold" table containing every attribute anyone ever asked for
- fact tables with mixed grains
- dimensions built from whichever source system was easiest to access
- silently changing field meaning without versioning
- exposing raw bronze data as if it were curated truth
- overusing JSON blobs instead of governing important columns
- using SCD Type 2 everywhere without a business reason
- optimizing storage layout before defining semantic ownership
Use this sequence when deciding how to model a domain:
- Define the business process or entity boundary.
- Declare the grain explicitly.
- Identify stable keys and change behavior.
- Separate canonical integration models from serving models.
- Choose the consumer shape:
- normalized for operations
- dimensional for BI
- event or feature oriented for ML
- Choose the physical design:
- partitioning
- clustering
- compaction
- incremental strategy
- Attach quality tests and ownership to the model.
- Document the contract and the reasons behind the design.
- Model business processes, not source tables.
- Preserve event time and business grain early.
- Favor simple marts over overly abstract "enterprise" schemas that no one uses.
- Do not ask analysts to reverse-engineer operational semantics from application tables.
- Do not confuse lakehouse flexibility with permission to skip modeling discipline.
Good data modeling is less about formal diagrams and more about durable decisions:
- what happened
- at what grain
- for which business entity
- with what history
- under which constraints
Dimensional modeling remains the most practical default for analytics and BI. Normalized models remain the right default for operational systems. Data Vault is valuable for integration-heavy, high-change enterprises when used with discipline and automation. Lakehouse platforms expand implementation options, but they do not remove the need for clear keys, grain, historical strategy, and governed schemas.
If a model is easy to query, hard to misuse, resilient to source change, and explicit about time and grain, it is usually on the right track.