Skip to content

Commit 9dbb855

Browse files
committed
Release v0.2.2: Add batch_size, update metadata & docs
1 parent 694451e commit 9dbb855

File tree

9 files changed

+114
-117
lines changed

9 files changed

+114
-117
lines changed

Cargo.lock

Lines changed: 1 addition & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Cargo.toml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
[package]
22
name = "phaeton"
3-
version = "0.1.2"
3+
version = "0.2.2"
44
edition = "2021"
5-
authors = ["Your Name <zahraandzakiits@gmail.com>"]
5+
authors = ["Zahraan Dzakii Tsaqiif <zahraandzakiits@gmail.com>"]
66
description = "A high-performance Python library for preprocessing and sanitizing raw data streams, accelerated by Rust."
77
license = "MIT"
88

README.md

Lines changed: 64 additions & 86 deletions
Original file line numberDiff line numberDiff line change
@@ -5,114 +5,89 @@
55
[![Rust](https://img.shields.io/badge/built%20with-Rust-orange)](https://www.rust-lang.org/)
66
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
77

8-
> ⚠️ **Project Status:** Phaeton is currently in **Experimental Beta (v0.2.0)**.
8+
> ⚠️ **Project Status:** Phaeton is currently in **Experimental Beta (v0.2.1)**.
99
> The core streaming engine is functional, but the library is currently under limited maintenance due to the author's personal schedule.
1010
1111

1212
**Phaeton** is a specialized, Rust-powered preprocessing engine designed to sanitize raw data streams before they reach your analytical environment.
1313

14-
It acts as the strictly typed **"Gatekeeper"** of your data pipeline. Unlike traditional DataFrame libraries that load entire datasets into RAM, Phaeton employs a **zero-copy streaming architecture**. It processes data chunk-by-chunkfiltering noise, fixing encodings, and standardizing formats ensuring **O(1) memory complexity**.
14+
It acts as the strictly typed **"Gatekeeper"** of your data pipeline. Unlike traditional DataFrame libraries that attempt to load entire datasets into RAM, Phaeton employs a **zero-copy streaming architecture**. It processes data chunk-by-chunk filtering noise, fixing encodings, and standardizing formats ensuring **O(1) memory complexity** relative to file size.
1515

16-
This allows you to process massive datasets (GBs/TBs) on standard hardware without memory spikes, delivering clean, high-quality data to downstream tools like Pandas, Polars, or ML models.
16+
This allows you to process massive datasets on standard hardware without memory spikes, delivering clean, high-quality data to downstream tools like Pandas, Polars, or ML models.
1717

1818
> **The Philosophy:** Don't waste memory loading garbage. Clean the stream first, then analyze the gold.
1919
2020
---
2121

22-
## 🚀 Key Features
22+
## Key Features
2323

24-
* **Streaming Architecture:** Processes files chunk-by-chunk. Memory usage remains flat and low regardless of file size.
25-
* **Parallel Execution:** Utilizes all CPU cores via Rayon (Rust) for heavy lifting (Regex, Fuzzy Matching).
26-
* **Strict Quarantine:** Bad data isn't just dropped; it's quarantined into a separate file with a generated `_phaeton_reason` column for auditing.
27-
* **Smart Casting:** Automatically handles messy currency formats (e.g., `"$ 5.000,00"``5000.0` float) without manual string parsing.
28-
* **Zero-Copy Logic:** Built on Rust's `Cow<str>` to minimize memory allocation during processing.
24+
* **Streaming Architecture:** Processes files chunk-by-chunk. Memory usage remains stable regardless of whether the file is 100MB or 100GB.
25+
* **Parallel Execution:** Utilizes all CPU cores via **Rust Rayon** to handle heavy lifting (Regex, Fuzzy Matching) without blocking Python.
26+
* **Strict Quarantine:** Bad data isn't just dropped silently; it's quarantined into a separate file with a generated `_phaeton_reason` column for auditing.
27+
* **Smart Casting:** Automatically handles messy formats (e.g., `"Rp 5.250.000,00"``5250000` int) without complex manual parsing.
28+
* **Configurable Engine:** Full control over `batch_size` and worker threads to tune performance for low-memory devices or high-end servers.
2929

3030
---
3131

32-
## 📦 Installation
32+
## Performance Benchmark
3333

34-
```bash
35-
pip install phaeton
36-
```
34+
Phaeton is optimized for "Dirty Data" scenarios involving heavy string parsing, regex filtering, and fuzzy matching.
3735

38-
## ⚡ Key Features
3936

40-
**1. The Scenario**
37+
**Test Scenario:**
38+
We generated a **Chaos Dataset** containing **1 Million Rows** of mixed dirty data:
39+
* **Operations:** Trim whitespace, Currency scrubbing (`$ 50.000,00` -> `50000`), Type casting, Fuzzy Alignment (Typo correction for City names), and Regex Filtering.
40+
* **Hardware:** Entry-level Laptop (Intel Core i3-1220P, 16GB RAM).
4141

42-
You have a dirty CSV `(raw_data.csv)` with mixed encodings, typos in city names, and messy currency strings. You want a clean Parquet file for Pandas.
42+
**Results:**
4343

44-
**2. The Code**
44+
| OS Environment | Speed (Rows/sec) | Duration (1M Rows) | Throughput |
45+
| :--- | :--- | :--- | :--- |
46+
| **Windows 11** | **~820,000 rows/s** | **1.21s** | **~70 MB/s** |
47+
| **Linux (Arch)** | ~575,000 rows/s | 1.73s | ~49 MB/s |
48+
49+
> *Note: Phaeton maintains a low and predictable memory footprint (~10-20MB overhead) regardless of the input file size due to its streaming nature.*
50+
51+
---
52+
## Usage Example
4553

4654
```python
4755
import phaeton
4856

49-
# 1. Probe the file (Auto-detect encoding, delimiter, headers, etc)
50-
info = phaeton.probe("raw_data.csv")
51-
print(f"Detected: {info['encoding']} with delimiter '{info['delimiter']}'")
52-
53-
# 2. Initialize Engine (0 = Use all CPU cores)
54-
eng = phaeton.Engine(workers=0)
57+
# 1. Initialize Engine (Auto-detect cores)
58+
engine = phaeton.Engine()
5559

56-
# 3. Build the Pipeline
60+
# 2. Define Pipeline
5761
pipeline = (
58-
eng.ingest("raw_data.csv")
59-
60-
# GATEKEEPING: Fix encoding & standardize headers
61-
.decode(encoding=info['encoding'])
62-
.headers(style="snake")
63-
64-
# ELIMINATION: Remove useless rows
65-
.prune(col="email") # Drop rows with empty email
66-
.discard(col="status", match="BANNED", mode="exact")
67-
68-
# TRANSFORMATION: Smart Cleaning
69-
# "$ 30.000,00" -> 30000 (Integer)
70-
# If it fails (e.g., "Free"), send row to Quarantine
71-
.cast("salary", type="int", clean=True, on_error="quarantine")
72-
73-
# FUZZY FIXING: Fix typos ("Cihcago" -> "Chicago")
74-
.fuzzyalign(
75-
col="city",
76-
ref=["Chicago", "Jakarta", "Shanghai"],
77-
threshold=0.85
78-
)
79-
80-
# OUTPUT: Split into Clean Data & Audit Log
81-
.quarantine("bad_data_audit.csv")
82-
.dump("clean_data.parquet")
62+
engine.ingest("dirty_data.csv")
63+
.prune(col="email") # Drop rows if email is empty
64+
.discard("status", "BANNED", mode="exact") # Filter specific values
65+
.scrub("username", "trim") # Clean whitespace
66+
.scrub("salary", "currency") # Parse "Rp 5.000" to number
67+
.cast("salary", "int", clean=True) # Safely cast to Integer
68+
.fuzzyalign("city", ref=["Jakarta", "Bandung"], threshold=0.85) # Fix typos
69+
.quarantine("quarantine.csv") # Save bad data here
70+
.dump("clean_data.csv") # Save good data here
8371
)
8472

85-
# 4. Execute (Rust takes over)
86-
stats = eng.exec([pipeline])
87-
88-
print(f"Processed: {stats.processed} rows")
89-
print(f"Saved: {stats.saved} | Quarantined: {stats.quarantined}")
73+
# 3. Execute
74+
stats = engine.exec(pipeline)
75+
print(f"Processed: {stats.processed}, Saved: {stats.saved}")
9076
```
9177

92-
<br>
93-
94-
## 📊 Performance Benchmark
95-
96-
Phaeton is optimized for "Dirty Data" scenarios (String parsing, Regex filtering, Fuzzy matching).
97-
98-
**Test Environment:**
99-
- **Dataset:** 1 Million Rows (Mixed dirty data: Typos, Currency strings, Encoding issues).
100-
- **Hardware:** Entry Level Laptop.
101-
102-
**Result:**
103-
| Metric | Phaeton |
104-
| :---: | :---: |
105-
| Speed | ~575,000 rows/sec |
106-
| Memory Usage | ~50MB (Constant) |
107-
| Strategy | Parallel Streaming |
78+
---
10879

109-
<br>
80+
## Installation
11081

111-
> Note: Phaeton maintains low memory footprint even when processing multi-gigabyte files due to its zero-copy streaming architecture.
82+
Phaeton provides **Universal Wheels (ABI3)**. No Rust compiler needed.
83+
```bash
84+
pip install phaeton
85+
```
86+
> **Supported:** Python 3.8+ on Windows, Linux, and macOS (Intel & Apple Silicon).
11287
113-
<br>
88+
---
11489

115-
## 📚 API Reference
90+
## API Reference
11691

11792
### Root Module <br>
11893
| Method | Description
@@ -155,24 +130,27 @@ Methods to save the final results or handle rejected data.
155130

156131
---
157132

158-
## 🗺️ Roadmap
133+
## Roadmap
159134

160-
Phaeton is currently in **Beta (v0.2.0)**. Here is the status of our development pipeline:
135+
Phaeton is currently in **Beta (v0.2.1)**. Here is the status of our development pipeline:
161136

162-
| Feature | Status | Notes |
137+
| Feature | Status | Implementation Notes |
163138
| :--- | :---: | :--- |
164-
| **Parallel Streaming Engine** | ✅ Ready | Powered by Rayon |
165-
| **Smart Type Casting** | ✅ Ready | Auto-clean numeric strings |
166-
| **Quarantine Logic** | ✅ Ready | Audit logs for bad data |
167-
| **Fuzzy Alignment** | ✅ Ready | Jaro-Winkler / Levenshtein |
168-
| **SHA-256 Hashing** | 📝 Planned | Security for PII data |
169-
| **Column Splitting & Combining** | 📝 Planned | - |
170-
| **Imputation (`.fill()`)** | 📝 Planned | Mean/Median/Mode fill |
171-
| **Parquet/Arrow Integration** | 📝 Planned | Native output support |
139+
| **Parallel Streaming Engine** | ✅ Ready | Powered by Rust Rayon (Multi-core) |
140+
| **Regex & Filter Logic** | ✅ Ready | `keep`, `discard`, `prune` implemented |
141+
| **Smart Type Casting** | ✅ Ready | Auto-clean numeric strings (`"Rp 5,000"` -> `5000`) |
142+
| **Fuzzy Alignment** | ✅ Ready | Jaro-Winkler for typo correction |
143+
| **Quarantine System** | ✅ Ready | Full audit trail for rejected rows |
144+
| **Basic Text Scrubbing** | ✅ Ready | Trim, HTML strip, Case conversion |
145+
| **Header Normalization** | 🚧 In Progress | `snake_case`, `camelCase` conversions |
146+
| **Date Normalization** | 🚧 In Progress | Auto-detect & reformat dates |
147+
| **Deduplication** | 📝 Planned | Row-level & Column-level dedupe |
148+
| **Hashing & Anonymization** | 📝 Planned | SHA-256 for PII data |
149+
| **Parquet/Arrow Support** | 📝 Planned | Native output integration |
172150

173151
---
174152

175-
## 🤝 Contributing
153+
## Contributing
176154

177155
This project is built with **Maturin** (PyO3 + Rust). Interested in contributing?
178156

pyproject.toml

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4,10 +4,10 @@ build-backend = "maturin"
44

55
[project]
66
name = "phaeton"
7-
version = "0.2.1"
8-
description = "High-performance preprocessing and streaming data cleaning, powered by Rust."
7+
version = "0.2.2"
8+
description = "A high-performance Python library for preprocessing and sanitizing raw data streams, accelerated by Rust."
99
readme = "README.md"
10-
license = {text = "MIT"}
10+
license = {file = "MIT"}
1111
authors = [
1212
{name = "Zahraan Dzakii Tsaqiif", email = "zahraandzakiits@gmail.com"}
1313
]
@@ -25,13 +25,16 @@ classifiers = [
2525
"Programming Language :: Python :: 3.10",
2626
"Programming Language :: Python :: 3.11",
2727
"Programming Language :: Python :: 3.12",
28+
"Programming Language :: Python :: 3.13",
29+
"Topic :: Software Development :: Libraries :: Python Modules",
2830
"Topic :: Scientific/Engineering :: Information Analysis",
2931
]
3032

3133
[project.urls]
3234
Homepage = "https://github.com/rannd1nt/phaeton"
3335
Repository = "https://github.com/rannd1nt/phaeton"
3436
Issues = "https://github.com/rannd1nt/phaeton/issues"
37+
Documentation = "https://github.com/rannd1nt/phaeton#readme"
3538

3639
[project.optional-dependencies]
3740
dev = [

python/phaeton/__init__.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,16 +7,16 @@
77
try:
88
from ._phaeton import __version__ as _rust_version
99
except ImportError:
10-
_rust_version = "0.2.0-alpha"
10+
_rust_version = "0.2.2-beta"
1111

1212
def version() -> str:
1313
"""
1414
Returns the current version of the Phaeton library and the underlying Rust engine.
1515
1616
Returns:
17-
str: Version string (e.g., "Phaeton v0.2.0 (Engine: Rust v0.1.1)").
17+
str: Version string (e.g., "Phaeton v1.1.0 (Phaeton Rust Core: v1.1.0)").
1818
"""
19-
return f"Phaeton v0.2.0 (Engine: Rust v{_rust_version})"
19+
return f"Phaeton v0.2.2-beta (Phaeton Rust Core: v{_rust_version})"
2020

2121
def probe(source: str) -> dict:
2222
"""

python/phaeton/engine.py

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -30,15 +30,19 @@ class Engine:
3030
of one or multiple pipelines simultaneously.
3131
"""
3232

33-
def __init__(self, workers: int = 0):
33+
def __init__(self, workers: int = 0, batch_size: int = 10000):
3434
"""
3535
Initialize the Engine.
3636
3737
Args:
3838
workers (int, optional): Number of CPU threads to use.
3939
Set to 0 to automatically use all available cores. Defaults to 0.
40+
batch_size (int, optional): Number of rows to process in each batch.
41+
Defaults to 10000.
42+
43+
4044
"""
41-
self.config = {"workers": workers}
45+
self.config = {"workers": workers, "batch_size": batch_size}
4246

4347
def ingest(self, source: str) -> Pipeline:
4448
"""

src/engine.rs

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -6,10 +6,11 @@ use std::time::Instant;
66

77
pub struct Engine {
88
workers: usize,
9+
batch_size: usize,
910
}
1011

1112
impl Engine {
12-
pub fn new(workers: usize) -> Self {
13+
pub fn new(workers: usize, batch_size: usize) -> Self {
1314
let actual_workers = if workers == 0 {
1415
// Auto-detect CPU cores
1516
std::thread::available_parallelism()
@@ -25,9 +26,12 @@ impl Engine {
2526
.build_global()
2627
.ok(); // Ignore if already built
2728

28-
Self { workers: actual_workers }
29+
Self {
30+
workers: actual_workers,
31+
batch_size: if batch_size == 0 { 10_000 } else { batch_size }
32+
}
2933
}
30-
34+
3135
/// Execute single pipeline (non-parallel)
3236
pub fn execute_single(&self, payload: HashMap<String, serde_json::Value>) -> Result<HashMap<String, u64>> {
3337
let start = Instant::now();
@@ -48,7 +52,7 @@ impl Engine {
4852
let quarantine = payload.get("quarantine")
4953
.and_then(|v| v.as_str());
5054

51-
let processor = StreamProcessor::new(source, steps, 0);
55+
let processor = StreamProcessor::new(source, steps, 0, self.batch_size);
5256
let stats = processor.execute(output, quarantine)?;
5357

5458
let mut result = HashMap::new();

src/lib.rs

Lines changed: 10 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
use pyo3::prelude::*;
22
use std::collections::HashMap;
3-
use pythonize::depythonize; // <--- Import penerjemah
3+
use pythonize::depythonize;
44
use serde_json::Value;
55

66
mod engine;
@@ -26,7 +26,7 @@ fn preview_pipeline(_py: Python, source: String, steps_py: PyObject, n: usize) -
2626
let steps: Vec<HashMap<String, Value>> = depythonize(steps_py.as_ref(_py))
2727
.map_err(|e| PyErr::new::<pyo3::exceptions::PyValueError, _>(format!("Invalid steps format: {}", e)))?;
2828

29-
let processor = StreamProcessor::new(source, steps, n);
29+
let processor = StreamProcessor::new(source, steps, n, 10000);
3030
let preview = processor.peek()
3131
.map_err(|e| PyErr::new::<pyo3::exceptions::PyRuntimeError, _>(e.to_string()))?;
3232

@@ -39,7 +39,7 @@ fn execute_pipeline(_py: Python, payload_py: PyObject) -> PyResult<HashMap<Strin
3939
let payload: HashMap<String, Value> = depythonize(payload_py.as_ref(_py))
4040
.map_err(|e| PyErr::new::<pyo3::exceptions::PyValueError, _>(format!("Invalid payload format: {}", e)))?;
4141

42-
let engine = Engine::new(0);
42+
let engine = Engine::new(0, 10000);
4343
let stats = engine.execute_single(payload)
4444
.map_err(|e| PyErr::new::<pyo3::exceptions::PyRuntimeError, _>(e.to_string()))?;
4545

@@ -62,11 +62,17 @@ fn execute_batch(
6262
let config: HashMap<String, Value> = depythonize(config_py.as_ref(_py))
6363
.map_err(|e| PyErr::new::<pyo3::exceptions::PyValueError, _>(format!("Invalid config: {}", e)))?;
6464

65+
// Get number of workers
6566
let workers = config.get("workers")
6667
.and_then(|v| v.as_u64())
6768
.unwrap_or(0) as usize;
69+
70+
// Get batch size
71+
let batch_size = config.get("batch_size")
72+
.and_then(|v| v.as_u64())
73+
.unwrap_or(10_000) as usize;
6874

69-
let engine = Engine::new(workers);
75+
let engine = Engine::new(workers, batch_size);
7076
let results = engine.execute_parallel(payloads)
7177
.map_err(|e| PyErr::new::<pyo3::exceptions::PyRuntimeError, _>(e.to_string()))?;
7278

0 commit comments

Comments
 (0)