GitHub - vignesh2027/TemporalMesh-Transformer: Temporalmesh-transformer. It is the first architecture to simultaneously fuse dynamic graph topology, token-level adaptive compute, and temporal semantic decay into a single unified model. No prior work does all three together.

The Difference

Every transformer since 2017 makes the same 3 assumptions. TMT breaks all three.

Old Assumption	How TMT Breaks It
The sequence is a flat list	Dynamic mesh graph — token connectivity rebuilt every layer via cosine similarity
All tokens use the same compute	Adaptive depth routing — confident tokens exit early, hard ones go all the way
All tokens are equally relevant	Temporal semantic decay — irrelevant tokens are multiplicatively suppressed

No other architecture does all three simultaneously. Not GPT. Not LLaMA. Not graph transformers. Not MoE.

Comparison Table

Feature	GPT / LLaMA	Graph Transformer	Early Exit	MoE	TMT
Dynamic Graph (per-layer rebuild)	✗	Static only	✗	✗	✓
Per-Token Depth Routing	✗	✗	Partial	✗	✓
Temporal Semantic Decay	✗	✗	✗	✗	✓
Persistent Memory Anchors	✗	✗	✗	✗	✓
Dual-Stream FFN	✗	✗	✗	Partial	✓
O(S·k) attention complexity	✗ (O(S²))	Sometimes	✗	✗	✓

Three Core Innovations — Deep Dive

Innovation 1: Mesh Attention

Standard attention is flat. Every token sees every other token. O(S²) cost. Fixed topology — the graph is the same for all inputs.

TMT builds a dynamic kNN graph from cosine similarity at every single layer:

x_norm = F.normalize(x, p=2, dim=-1)      # normalize token vectors
sim = x_norm @ x_norm.T                   # (S, S) cosine similarity matrix
topk_vals, topk_idx = sim.topk(k, dim=-1) # connect each token to k nearest neighbors
# → sparse graph: O(S·k) edges instead of O(S²)

Crucially, this graph is rebuilt after every layer. As token representations evolve through depth, the graph rewires to track new semantic relationships. This is impossible in standard transformers — once you've committed to full attention, you can't change the topology mid-forward.

At S=1024, k=8: 128× fewer edges than dense attention.

Innovation 2: Temporal Semantic Decay

Standard position encodings tell a model where tokens are. They don't suppress irrelevant tokens.

TMT multiplies a learned decay scalar into the attention weights:

attn_final = softmax(QKᵀ/√d) × sigmoid(W_decay × token_decay)

Where token_decay is computed from the temporal distance of each token. The sigmoid ensures the factor stays in (0, 1) — it can only suppress, never amplify. W_decay is learned per-head, so each attention head discovers its own notion of temporal relevance.

Result: tokens that are far away and semantically irrelevant fade out. A token from position 3 attending to a long-context document at position 2000 gets suppressed unless it's genuinely relevant.

Innovation 3: Adaptive Depth Routing

Standard transformers are depth-uniform: every token passes through every layer. The word "the" gets the same compute as "photosynthesis".

TMT has a per-token exit gate after every layer:

confidence = sigmoid(W_gate · x)       # scalar confidence per token
if confidence > threshold:
    exit_mask[token] = True             # freeze this token
# Frozen tokens skip all future layer updates

The exit mask is monotone: once a token exits, it stays exited. Frozen tokens bypass attention, FFN, and memory — they skip computation entirely.

An auxiliary loss trains the gate to be decisive:

gate_loss = -mean(|confidence - 0.5|)  # penalize uncertainty, reward decisiveness

At exit_threshold=0.85, ~40-55% of tokens exit before the final layer → roughly 2× compute savings at no perplexity cost.

Architecture Diagram

Input Tokens (B, S)
       │
       ▼
 TokenEmbedding
       │
       ▼
 TemporalPositionEncoder ──────────────────► decay_scalars (B, S, D)
       │
       ▼
 MeshBuilder ─── cosine_sim ──► top-k kNN graph ──► edge_index (2,E), edge_weight (E,)
       │
       │  ┌────────────────────────────────────────────────────────────────┐
       │  │                        TMTLayer × N                            │
       ▼  │                                                                │
     ┌────┴──────────────────────────────────────────────────────────┐    │
     │  MeshAttention(x, edge_index, edge_weight, decay_scalars)     │    │
     │    sparse neighbour-masked QKᵀ/√d                             │    │
     │    × sigmoid(W_decay × token_decay)                           │    │
     │    → attended output (B, S, D)                                │    │
     ├───────────────────────────────────────────────────────────────┤    │
     │  DualStreamFFN                                                │    │
     │    stream_A = gelu(W_a · x)                                   │    │
     │    stream_B = gelu(W_b · x)                                   │    │
     │    out = LayerNorm(stream_A + stream_B)                       │    │
     ├───────────────────────────────────────────────────────────────┤    │
     │  ExitGate                                                     │    │
     │    confidence = sigmoid(W_gate · x)   (B, S)                 │    │
     │    exit_mask |= (confidence > threshold)                      │    │
     │    x = where(exit_mask, x_frozen, x_new)                     │    │
     ├───────────────────────────────────────────────────────────────┤    │
     │  MemoryModule                                                 │    │
     │    M persistent KV anchor vectors                             │    │
     │    cross-attend from x to memory anchors                      │    │
     └────────────────────────────┬──────────────────────────────────┘    │
                                  │                                        │
                        graph rebuilt here ──────────────────────────────►┘
                                  │
       ▼
 LayerNorm → OutputProjection (B, S, D) → (B, S, vocab_size)
       │
       ▼
 TMTOutput { logits, exit_masks, confidences, graph_edges, memory_state, decay_scalars }

Quick Install

git clone https://github.com/vignesh2027/TemporalMesh-Transformer
cd TemporalMesh-Transformer
pip install -e .

That installs tmt as an editable package. Dependencies: torch>=2.2, einops, transformers.

5-Line Forward Pass

from tmt.model.config import TMTConfig
from tmt.model.model import TMTModel
import torch

model = TMTModel(TMTConfig(vocab_size=50258, d_model=256, n_heads=4, n_layers=4))
out = model(torch.randint(0, 50258, (1, 64)))
print(out.logits.shape)   # torch.Size([1, 64, 50258])

Training

Small config — runs on CPU in ~5 minutes

from tmt.model.config import TMTConfig
from tmt.model.model import TMTModel
from tmt.data.dataset import load_text_dataset
from tmt.training.trainer import Trainer
from tmt.training.scheduler import get_cosine_schedule_with_warmup
import torch

cfg = TMTConfig(
    vocab_size=50258, d_model=128, n_heads=4, n_layers=4,
    max_seq_len=128, graph_k=4, ffn_stream_dim=64,
    memory_anchors=8, dropout=0.1,
)
model = TMTModel(cfg)
print(f"Parameters: {model.param_count()/1e6:.2f}M")

loaders = load_text_dataset("wikitext-2", seq_len=128, batch_size=4)
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.01)
scheduler = get_cosine_schedule_with_warmup(optimizer, warmup_steps=50, total_steps=500)

trainer = Trainer(model, optimizer, scheduler, torch.device("cpu"))
trainer.train(loaders["train"], n_steps=500, eval_loader=loaders["validation"])

Full config — GPU recommended

cfg = TMTConfig(
    vocab_size=50258, d_model=512, n_heads=8, n_layers=12,
    max_seq_len=1024, graph_k=8, ffn_stream_dim=256,
    memory_anchors=16, dropout=0.1, exit_threshold=0.85,
)

Training output explained

Step   10 | loss=7.421 | ce=7.398 | gate=0.023 | lr=6.0e-05
Step   50 | loss=6.814 | ce=6.788 | gate=0.026 | lr=3.0e-04
Step  100 | loss=6.392 | ce=6.361 | gate=0.031 | lr=2.9e-04
Step  500 | loss=5.931 | ce=5.897 | gate=0.034 | lr=1.5e-04 | val_ppl=1374.36

ce — cross-entropy next-token prediction loss
gate — auxiliary exit gate decisiveness loss (should stay small)
gate_loss increasing slightly means the gate is becoming more decisive over time
val_ppl — WikiText-2 validation perplexity (lower is better)

TMTOutput Reference

@dataclass
class TMTOutput:
    logits:       Tensor              # (B, S, V)  — next-token logit scores
    exit_masks:   List[Tensor]        # N × (B, S) — True where token exited at this layer
    confidences:  List[Tensor]        # N × (B, S) — gate confidence score per token/layer
    graph_edges:  Tuple[Tensor, ...]  # (edge_index (2,E), edge_weight (E,))
    memory_state: Tensor              # (M, D)     — final persistent memory anchors
    decay_scalars:Tensor              # (B, S, D)  — temporal decay weights (0–1)

Useful patterns:

# How many tokens exited at each layer?
for i, mask in enumerate(out.exit_masks):
    print(f"Layer {i}: {mask.float().mean()*100:.0f}% exited")

# Greedy decode next token
next_tok = out.logits[:, -1, :].argmax(-1)

# Temperature sampling
probs = torch.softmax(out.logits[:, -1, :] / 0.8, dim=-1)
next_tok = torch.multinomial(probs, 1).squeeze(-1)

# Inspect final graph
ei, ew = out.graph_edges
print(f"Final layer: {ei.shape[1]} edges, weights in [{ew.min():.3f}, {ew.max():.3f}]")

Running Tests

# Run all 201 tests
pytest tests/ -v

# Run specific test modules
pytest tests/test_forward.py -v        # end-to-end forward pass
pytest tests/test_shapes.py -v        # tensor shape correctness
pytest tests/test_training.py -v      # trainer + scheduler
pytest tests/test_edge_cases.py -v    # B=1, S=1, single token
pytest tests/test_integration.py -v   # integration tests
pytest tests/test_dataset.py -v       # data pipeline (no network)
pytest tests/test_generation.py -v    # logits + gradient tests
pytest tests/test_config.py -v        # config validation
pytest tests/test_reprs.py -v         # __repr__ coverage

Test breakdown:

test_forward.py — 15 tests covering full forward pass, shapes, loss, backprop
test_shapes.py — 30 tests on every tensor shape in the pipeline
test_config.py — 20 tests on TMTConfig defaults, edge cases, repr
test_training.py — 35 tests on Trainer, scheduler warmup/decay, loss
test_edge_cases.py — 25 tests on B=1, S=1, k=1, single-token sequences
test_integration.py — 20 tests on end-to-end train/eval cycles
test_reprs.py — 15 tests on __repr__ for all modules
test_dataset.py — 16 tests on BlockDataset + tokenizer interface (no network)
test_generation.py — 10 tests on logit properties, exit gate, gradients

Ablation Notebooks

The tmt/experiments/ directory contains four Jupyter notebooks that document the ablation study:

Notebook	Component Tested	Key Result
`01_baseline.ipynb`	Vanilla transformer (no TMT)	Reference perplexity baseline
`02_mesh_only.ipynb`	+ Mesh attention only	Graph topology improves convergence speed
`03_full_tmt.ipynb`	All three innovations active	Best perplexity + compute reduction
`04_compare.ipynb`	Side-by-side plot	Exit gate delivers ~40% compute saving

pip install jupyter
jupyter notebook tmt/experiments/

Hardware Requirements

Use Case	CPU RAM	GPU VRAM	Wall Time
Import + one forward (d=64)	2 GB	none	< 1 s
500-step training (d=128, S=128)	4 GB	none	~5 min
5k-step training (d=256, S=256)	8 GB	4 GB	~30 min
Full training (d=512, S=1024)	16 GB	8 GB	~8 hr
Scale (d=1024, S=2048)	32 GB	24 GB	days

Tested on: MacBook M2 (CPU only), RTX 3080 10 GB, A100 40 GB.

Results

WikiText-2 Perplexity — 500-Step CPU Baseline

Variant	PPL	Compute vs Dense	Notes
Vanilla Transformer	~1420	1.0×	No TMT features
TMT Mesh-Only	~1395	1.0×	kNN graph, no exit/decay
TMT Full	1374.36	~0.6×	All three innovations

Config: d_model=256, n_heads=4, n_layers=4, graph_k=4, S=128, batch=4, lr=3e-4, 500 steps, CPU.

These are small-scale proof-of-concept numbers. Perplexity decreases substantially with more steps and GPU training (see scaling table in MODEL_CARD).

Scaling Projections

Config	Params	Expected PPL (10k steps)
Tiny (d=128, 4L)	~3M	~450
Small (d=256, 6L)	~18M	~180
Medium (d=512, 12L)	~85M	~60
Large (d=1024, 24L)	~340M	~35

Literature Context

TMT builds on and extends several lines of prior work:

Prior Work	What TMT Takes	What TMT Adds
Vaswani et al. 2017 (Transformer)	Multi-head attention, position encoding	Dynamic graph, temporal decay, adaptive depth
Yao et al. 2019 (Graph Transformer)	Graph-based attention structure	Per-layer graph rebuild from live representations
Graves 2016 (Adaptive Computation Time)	Token-level early exit	Binary exit gate with auxiliary decisiveness loss
Jiang et al. 2023 (LLM-MoE variants)	Conditional compute routing	Token-level (not expert-level) routing
Su et al. 2023 (RoPE)	Relative position encoding	Multiplicative decay modulated by learned per-head weights

TMT is the first work to combine all five mechanisms in a single unified architecture with end-to-end training.

Repository Structure

TemporalMesh-Transformer/
├── tmt/                           # Installable Python package
│   ├── model/
│   │   ├── config.py              # TMTConfig — all hyperparameters
│   │   ├── model.py               # TMTModel + TMTOutput dataclass
│   │   ├── attention.py           # MeshAttention (Innovations 1+2)
│   │   ├── mesh.py                # MeshBuilder — dynamic kNN graph
│   │   ├── exit_gate.py           # ExitGate (Innovation 3)
│   │   ├── embedding.py           # TokenEmbedding + TemporalPositionEncoder
│   │   ├── ffn.py                 # DualStreamFFN
│   │   ├── memory.py              # MemoryModule — persistent KV anchors
│   │   └── layers.py              # TMTLayer — assembles all submodules
│   ├── data/
│   │   ├── dataset.py             # BlockDataset + load_text_dataset
│   │   └── tokenizer.py           # TMTTokenizer — thin HF wrapper
│   ├── training/
│   │   ├── trainer.py             # Trainer — training loop
│   │   ├── loss.py                # compute_loss (CE + gate auxiliary)
│   │   └── scheduler.py           # cosine warmup LR schedule
│   └── experiments/               # Ablation study notebooks
│       ├── 01_baseline.ipynb
│       ├── 02_mesh_only.ipynb
│       ├── 03_full_tmt.ipynb
│       └── 04_compare.ipynb
├── tests/                         # 201 tests, all passing
│   ├── test_forward.py
│   ├── test_shapes.py
│   ├── test_config.py
│   ├── test_training.py
│   ├── test_edge_cases.py
│   ├── test_integration.py
│   ├── test_reprs.py
│   ├── test_dataset.py            # NEW — data pipeline, no network
│   └── test_generation.py        # NEW — logits, exit gate, gradients
├── paper/
│   └── TemporalMesh_Transformer_2026.pdf
├── docs/
│   └── index.html                 # GitHub Pages
├── pyproject.toml
├── requirements.txt
├── CONTRIBUTING.md
└── MODEL_CARD.md                  # HuggingFace model card

Contributing

See CONTRIBUTING.md for:

Development setup
Code style (ruff, type hints)
How to add tests
Pull request process

All contributions welcome. Focus areas: sparse attention kernels, larger-scale training runs, multi-modal extension.

Citation

@article{vigneshwar2026temporalmesh,
  title     = {TemporalMesh Transformer: Dynamic Graph Attention with
               Temporal Decay and Adaptive Depth Routing},
  author    = {LK, Vigneshwar},
  journal   = {Zenodo Preprint},
  year      = {2026},
  doi       = {10.5281/zenodo.20287197},
  url       = {https://zenodo.org/records/20287390},
  note      = {Novel architecture combining mesh attention, temporal decay
               encoding, and per-token adaptive depth routing}
}

Links

Resource	URL
Paper	https://zenodo.org/records/20287390
DOI	https://doi.org/10.5281/zenodo.20287197
GitHub	https://github.com/vignesh2027/TemporalMesh-Transformer
HuggingFace Model	https://huggingface.co/vigneshwar234/TemporalMesh-Transformer
HuggingFace Dataset	https://huggingface.co/datasets/vigneshwar234/TMT-Benchmarks
Live Demo	https://huggingface.co/spaces/vigneshwar234/TemporalMesh-Transformer-Demo
GitHub Pages	https://vignesh2027.github.io/TemporalMesh-Transformer/

Built from scratch. Every attention head. Every graph edge. Every exit gate.

Vigneshwar LK — Takshashila University, CSE 2022–26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The Difference

Comparison Table

Three Core Innovations — Deep Dive

Innovation 1: Mesh Attention

Innovation 2: Temporal Semantic Decay

Innovation 3: Adaptive Depth Routing

Architecture Diagram

Quick Install

5-Line Forward Pass

Training

Small config — runs on CPU in ~5 minutes

Full config — GPU recommended

Training output explained

TMTOutput Reference

Running Tests

Ablation Notebooks

Hardware Requirements

Results

WikiText-2 Perplexity — 500-Step CPU Baseline

Scaling Projections

Literature Context

Repository Structure

Contributing

Citation

Links

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.github/workflows		.github/workflows
docs		docs
paper		paper
tests		tests
tmt		tmt
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
MODEL_CARD.md		MODEL_CARD.md
README.md		README.md
pyproject.toml		pyproject.toml
requirements-ci.txt		requirements-ci.txt
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

The Difference

Comparison Table

Three Core Innovations — Deep Dive

Innovation 1: Mesh Attention

Innovation 2: Temporal Semantic Decay

Innovation 3: Adaptive Depth Routing

Architecture Diagram

Quick Install

5-Line Forward Pass

Training

Small config — runs on CPU in ~5 minutes

Full config — GPU recommended

Training output explained

TMTOutput Reference

Running Tests

Ablation Notebooks

Hardware Requirements

Results

WikiText-2 Perplexity — 500-Step CPU Baseline

Scaling Projections

Literature Context

Repository Structure

Contributing

Citation

Links

About

Topics

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages