Skip to content

VCD parser memory inefficiency for large simulations #46

@snirqm

Description

@snirqm

Problem

The current TraceVcd parser expands VCD data into a dense representation where every signal stores a value at every timestamp. This becomes a significant memory bottleneck for large simulations.

Example

For a simulation with:

  • 1,000 signals
  • 50,000 timestamps
  • ~1% signal change rate (typical for most hardware signals)

Current behavior: Allocates ~50 million entries in memory

Actual data: Only ~500K value changes exist in the VCD file

This is a 100x memory overhead for storing redundant unchanged values.

Root Cause

In vcd.py, the parser copies all previous values at each timestamp:

elif first_char == '#':
    time = int(tokens[i][1:])
    # copy old values - THIS IS THE PROBLEM
    for id in self.all_ids:
        self.data[id].append(self.data[id][-1])

Impact

  • Unable to load large VCD files that would otherwise fit in memory
  • Slow loading times due to excessive memory allocation
  • Poor cache locality from oversized data structures

Proposed Solution

Implement an alternative sparse VCD parser that:

  1. Stores only actual value changes per signal: {signal: ([change_indices], [values])}
  2. Uses binary search for O(log n) value lookup at any timestamp
  3. Maintains full API compatibility with existing TraceVcd
  4. Is opt-in via a sparse=True parameter to avoid breaking changes

Expected Benefits

Metric Dense (current) Sparse (proposed)
Memory O(signals × timestamps) O(total changes)
Typical compression 1x 50-200x
Access time O(1) O(log n)

Environment

  • WAL version: 0.8.6
  • Python: 3.12
  • OS: macOS/Linux

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions