Skip to content

Deal with non-numeric values such as NaN, Inf, Null at the simulation output stage where data goed from memory to parquet files #3

Description

@svdhoog
  1. The Native Simulation Output Stage (Strategic / Long-Term)

As the simulation is updated to stream Parquet files directly, non-numeric values must be treated at the exact moment the data is generated in memory, before it hits the disk.

Why at this stage: Parquet handles missing or specialized numeric data natively and efficiently using bitmasks for nullability and IEEE 754 floating-point standards for infinity/NaN. Storing these flags as raw text strings (e.g., writing the literal string "NaN" or "inf") causes severe schema degradation, forces columns into resource-intensive string types, and breaks vectorization.

How to treat them: * True Empty/Missing States: Map empty or missing simulation metrics directly to PyArrow / Parquet null entries. Parquet stores null values using a highly optimized validity bitmap, meaning missing data consumes virtually zero disk space and requires no parsing when loaded.

    Mathematical Boundaries: Represent actual simulation limits (e.g., division by zero in liquidity ratios) using native IEEE 754 values via float('inf'), float('-inf'), or float('nan'). When written via Arrow, these map perfectly to standard float types without shifting the column schema to a string type.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions