This is the reference architecture file for the bare-metal side of TensorOS. It describes the boot path, memory layout, and major runtime subsystems as they exist in the current tree.
BIOS → multiboot_stub.asm (Multiboot1) → boot.asm → entry64.asm → kernel_main() → AI Shell
- Multiboot1 header validated (magic
0x1BADB002) multiboot_stub.asm: serial checkpoints, CPUID checks, builds 16 GB identity map with 2 MB huge pages (PML4 → PDPT → PD at0x10000, 18 pages)- Loads
kernel64.binat0x200000, jumps to 32-bitboot.asm boot.asm: GDT loaded, PAE + long mode (IA32_EFER.LME) enabled- Jump to 64-bit
entry64.asm→kernel_main
tensor_mm_init()— physical bitmap, tensor heap, model cache, slab allocatorsmp_init()— LAPIC enable, AP bootstrap via INIT-SIPI-SIPI (up to 64 CPUs)tensor_sched_init()— priority queues, GPU device stategpu_init()/tpu_init()— PCI bus scan, capability detectiongit_init()— kernel-level git object storetensorfs_init()— AI-aware virtual filesystemsandbox_init()— security subsystemtensor_ipc_init()— IPC channelsvirt_init()— VT-x/AMD-V detection, container support
tensor_engine_init()— eager ops, compute graph enginepseudo_runtime_init()— Pseudocode JIT (lexer, parser ready)modelpkg_init()— model registries
monitor_daemon_main()— background monitoring (future: separate MEU)deploy_init()/train_init()— services readyaishell_main()— interactive shell starts, banner displayed
Page tables are built by multiboot_stub.asm at physical address 0x10000 (18 pages).
The first 16 GB of physical memory is identity-mapped using 2 MB huge pages.
0x0000_0000_0000_0000 ┌──────────────────────────────┐
│ Kernel code + data │
│ (loaded at 0x200000) │
0x0000_0000_0001_0000 │ Page tables (18 pages) │
│ PML4, PDPT, 8× PD │
├──────────────────────────────┤
│ SMP trampoline (0x8000) │
│ AP stacks (65 KB each) │
├──────────────────────────────┤
│ Identity-mapped first 16 GB │
│ (2 MB huge pages) │
│ ┌────────────────────────┐ │
│ │ Tensor Heap │ │
│ │ (dynamic, 2MB pages) │ │
│ ├────────────────────────┤ │
│ │ Model Cache (LRU, 64) │ │
│ ├────────────────────────┤ │
│ │ JIT Code Pool (2 MB) │ │
│ │ W^X, max 64 buffers │ │
│ ├────────────────────────┤ │
│ │ Git Object Store │ │
│ ├────────────────────────┤ │
│ │ IPC Shared Buffers │ │
│ └────────────────────────┘ │
0x0000_0004_0000_0000 └──────────────────────────────┘ (16 GB)
Actual heap/cache sizes depend on available RAM (detected from Multiboot1 mmap):
- 8 GB config: tensor heap ~4992 MB, model cache ~2976 MB
- 4 GB config: tensor heap ~256 MB, model cache ~512 MB
- Bump allocator with free-list fallback
- Coalesces adjacent free blocks
- 2MB huge page alignment for GPU DMA
- LRU eviction with 64 entry slots
- Each entry: name hash → physical address + reference count
- Cache hit = instant model load; cache miss = load from TensorFS
8 size classes: 16, 32, 64, 128, 256, 512, 1024, 2048 bytes. Used for kernel structures (MEU descriptors, scheduler queues, git objects).
Priority Queues:
[REALTIME] ──→ Safety-critical inference (medical, autonomous)
[CRITICAL] ──→ Low-latency serving
[HIGH] ──→ Interactive inference
[NORMAL] ──→ Batch inference, training
[LOW] ──→ Background optimization
[IDLE] ──→ Model prefetching, cache warming
- THROUGHPUT: Maximize tensor ops/sec (batch-friendly)
- LATENCY: Minimize time-to-first-token
- EFFICIENCY: Minimize power consumption
- FAIR: Equal GPU time across MEUs
When assigning an MEU to a GPU, the scheduler computes:
score = (available_VRAM × 4)
+ ((100 - utilization%) × 2)
+ ((100 - temperature°C) × 1)
+ (weight_locality_bonus × 8)
Weight locality bonus rewards GPUs that already have the model's weights cached.
If multiple MEUs request the same operation (e.g., matmul with same shapes), the scheduler coalesces them into a single batched GPU dispatch for throughput.
The intermediate representation used by the Pseudocode JIT and tensor engine:
| Opcode | Description |
|---|---|
| TIR_LOAD | Load tensor from memory |
| TIR_STORE | Store tensor to memory |
| TIR_MATMUL | Matrix multiplication |
| TIR_ADD / MUL / DIV / SUB | Elementwise arithmetic |
| TIR_RELU / GELU / SILU / TANH | Activations |
| TIR_SOFTMAX | Softmax normalization |
| TIR_LAYERNORM | Layer normalization |
| TIR_ATTENTION | Fused multi-head attention |
| TIR_CONV2D | 2D convolution |
| TIR_POOL | Pooling (max/avg) |
| TIR_EMBEDDING | Embedding lookup |
| TIR_TRANSPOSE | Tensor transpose |
| TIR_RESHAPE | Tensor reshape |
| TIR_CONCAT | Tensor concatenation |
| TIR_SPLIT | Tensor split |
| TIR_REDUCE | Reduction (sum/mean/max) |
| TIR_CAST | Dtype conversion |
| TIR_ALLOC / FREE | Memory management |
| TIR_BRANCH / CALL / RET | Control flow |
- Op Fusion: MATMUL → ADD (bias) → RELU fused into single kernel
- Precision Auto-Downgrade: FP32 → FP16 for compute, FP32 for accumulation
- Dead Code Elimination: Remove unused tensor computations
- Memory Planning: Static allocation of tensor buffers, reuse across ops
Three sandbox policies:
| Policy | Tensor Ops | GPU | Network | Filesystem | IPC |
|---|---|---|---|---|---|
| STRICT | Allowed | No | No | No | No |
| STANDARD | Allowed | Allowed | Read-only | Read-only | Allowed |
| PERMISSIVE | Allowed | Allowed | Allowed | Read/Write | Allowed |
- Every MEU has a permission bitmask (11 flags)
- Audit ring buffer logs all security-relevant operations
- Deterministic mode: fixed random seeds, no timing side channels
- Resource accounting: per-MEU memory and compute time tracking
- BSP enables LAPIC, copies trampoline to physical
0x8000 - Sends INIT-SIPI-SIPI to all APs (up to
MAX_CPUS=64) - Each AP: enters real mode at
0x8000, transitions to protected → long mode, gets a 65 KB stack, incrementssmp.ap_started, enters idle loop
BSP AP 0 AP 1 AP 2
│ │ │ │
├─ smp_dispatch(fn) ──►│ wake via IPI 0xFE │ │
│ ├─ execute fn(arg) │ │
├─ smp_dispatch(fn) ──►│ ├─ execute fn(arg) │
├─ smp_dispatch(fn) ──►│ │ ├─ execute fn(arg)
│ (BSP does own share)│ │ │
├─ smp_wait_all() ─────┤─── barrier ─────────┤─── barrier ─────────┤
│ │ │ │
When ncpu > 1 && out_dim >= 64, GEMV rows are partitioned across CPUs:
- CPU
cprocesses rows[c * rows_per_cpu, (c+1) * rows_per_cpu) - BSP executes its share directly, APs receive work via
smp_dispatch() - All CPUs join via
smp_wait_all()before the result is used - Supports both Q4_0 and Q8_0 fused AVX2 GEMV paths
- 2 MB executable code pool with W^X protection
- Max 64 concurrent JIT buffers
- Full x86_64 instruction encoder: REX, ModR/M, SIB, SSE2, AVX2 opcodes
- Register allocator using System V calling convention
Lazy-compiled on first inference call. Kernel cache holds up to 32 entries.
| Kernel | Op | Vector Size | Used For |
|---|---|---|---|
| vadd | a[i] + b[i] | dim (3072) | Residual connections |
| dot | Σ a[i]×b[i] | head_dim (96) | Attention score computation |
| axpy | a[i] + α×b[i] | head_dim (96) | Attention value accumulation |
| fused_silu_mul | silu(a[i])×b[i] | ff_dim (8192) | FFN gate ⊙ up projection |
| rope | Rotary encoding | head_dim (96) | Position encoding for Q/K |
| rmsnorm | RMS normalize | dim (3072) | Layer normalization |
Softmax is not JIT-compiled because sequence length varies per token.