|
5 | 5 | [](https://www.rust-lang.org/) |
6 | 6 | [](https://opensource.org/licenses/MIT) |
7 | 7 |
|
8 | | -> ⚠️ **Project Status:** Phaeton is currently in **Experimental Beta (v0.2.0)**. |
| 8 | +> ⚠️ **Project Status:** Phaeton is currently in **Experimental Beta (v0.2.1)**. |
9 | 9 | > The core streaming engine is functional, but the library is currently under limited maintenance due to the author's personal schedule. |
10 | 10 |
|
11 | 11 |
|
12 | 12 | **Phaeton** is a specialized, Rust-powered preprocessing engine designed to sanitize raw data streams before they reach your analytical environment. |
13 | 13 |
|
14 | | -It acts as the strictly typed **"Gatekeeper"** of your data pipeline. Unlike traditional DataFrame libraries that load entire datasets into RAM, Phaeton employs a **zero-copy streaming architecture**. It processes data chunk-by-chunk—filtering noise, fixing encodings, and standardizing formats ensuring **O(1) memory complexity**. |
| 14 | +It acts as the strictly typed **"Gatekeeper"** of your data pipeline. Unlike traditional DataFrame libraries that attempt to load entire datasets into RAM, Phaeton employs a **zero-copy streaming architecture**. It processes data chunk-by-chunk filtering noise, fixing encodings, and standardizing formats ensuring **O(1) memory complexity** relative to file size. |
15 | 15 |
|
16 | | -This allows you to process massive datasets (GBs/TBs) on standard hardware without memory spikes, delivering clean, high-quality data to downstream tools like Pandas, Polars, or ML models. |
| 16 | +This allows you to process massive datasets on standard hardware without memory spikes, delivering clean, high-quality data to downstream tools like Pandas, Polars, or ML models. |
17 | 17 |
|
18 | 18 | > **The Philosophy:** Don't waste memory loading garbage. Clean the stream first, then analyze the gold. |
19 | 19 |
|
20 | 20 | --- |
21 | 21 |
|
22 | | -## 🚀 Key Features |
| 22 | +## Key Features |
23 | 23 |
|
24 | | -* **Streaming Architecture:** Processes files chunk-by-chunk. Memory usage remains flat and low regardless of file size. |
25 | | -* **Parallel Execution:** Utilizes all CPU cores via Rayon (Rust) for heavy lifting (Regex, Fuzzy Matching). |
26 | | -* **Strict Quarantine:** Bad data isn't just dropped; it's quarantined into a separate file with a generated `_phaeton_reason` column for auditing. |
27 | | -* **Smart Casting:** Automatically handles messy currency formats (e.g., `"$ 5.000,00"` → `5000.0` float) without manual string parsing. |
28 | | -* **Zero-Copy Logic:** Built on Rust's `Cow<str>` to minimize memory allocation during processing. |
| 24 | +* **Streaming Architecture:** Processes files chunk-by-chunk. Memory usage remains stable regardless of whether the file is 100MB or 100GB. |
| 25 | +* **Parallel Execution:** Utilizes all CPU cores via **Rust Rayon** to handle heavy lifting (Regex, Fuzzy Matching) without blocking Python. |
| 26 | +* **Strict Quarantine:** Bad data isn't just dropped silently; it's quarantined into a separate file with a generated `_phaeton_reason` column for auditing. |
| 27 | +* **Smart Casting:** Automatically handles messy formats (e.g., `"Rp 5.250.000,00"` → `5250000` int) without complex manual parsing. |
| 28 | +* **Configurable Engine:** Full control over `batch_size` and worker threads to tune performance for low-memory devices or high-end servers. |
29 | 29 |
|
30 | 30 | --- |
31 | 31 |
|
32 | | -## 📦 Installation |
| 32 | +## Performance Benchmark |
33 | 33 |
|
34 | | -```bash |
35 | | -pip install phaeton |
36 | | -``` |
| 34 | +Phaeton is optimized for "Dirty Data" scenarios involving heavy string parsing, regex filtering, and fuzzy matching. |
37 | 35 |
|
38 | | -## ⚡ Key Features |
39 | 36 |
|
40 | | -**1. The Scenario** |
| 37 | +**Test Scenario:** |
| 38 | +We generated a **Chaos Dataset** containing **1 Million Rows** of mixed dirty data: |
| 39 | +* **Operations:** Trim whitespace, Currency scrubbing (`$ 50.000,00` -> `50000`), Type casting, Fuzzy Alignment (Typo correction for City names), and Regex Filtering. |
| 40 | +* **Hardware:** Entry-level Laptop (Intel Core i3-1220P, 16GB RAM). |
41 | 41 |
|
42 | | -You have a dirty CSV `(raw_data.csv)` with mixed encodings, typos in city names, and messy currency strings. You want a clean Parquet file for Pandas. |
| 42 | +**Results:** |
43 | 43 |
|
44 | | -**2. The Code** |
| 44 | +| OS Environment | Speed (Rows/sec) | Duration (1M Rows) | Throughput | |
| 45 | +| :--- | :--- | :--- | :--- | |
| 46 | +| **Windows 11** | **~820,000 rows/s** | **1.21s** | **~70 MB/s** | |
| 47 | +| **Linux (Arch)** | ~575,000 rows/s | 1.73s | ~49 MB/s | |
| 48 | + |
| 49 | +> *Note: Phaeton maintains a low and predictable memory footprint (~10-20MB overhead) regardless of the input file size due to its streaming nature.* |
| 50 | +
|
| 51 | +--- |
| 52 | +## Usage Example |
45 | 53 |
|
46 | 54 | ```python |
47 | 55 | import phaeton |
48 | 56 |
|
49 | | -# 1. Probe the file (Auto-detect encoding, delimiter, headers, etc) |
50 | | -info = phaeton.probe("raw_data.csv") |
51 | | -print(f"Detected: {info['encoding']} with delimiter '{info['delimiter']}'") |
52 | | - |
53 | | -# 2. Initialize Engine (0 = Use all CPU cores) |
54 | | -eng = phaeton.Engine(workers=0) |
| 57 | +# 1. Initialize Engine (Auto-detect cores) |
| 58 | +engine = phaeton.Engine() |
55 | 59 |
|
56 | | -# 3. Build the Pipeline |
| 60 | +# 2. Define Pipeline |
57 | 61 | pipeline = ( |
58 | | - eng.ingest("raw_data.csv") |
59 | | - |
60 | | - # GATEKEEPING: Fix encoding & standardize headers |
61 | | - .decode(encoding=info['encoding']) |
62 | | - .headers(style="snake") |
63 | | - |
64 | | - # ELIMINATION: Remove useless rows |
65 | | - .prune(col="email") # Drop rows with empty email |
66 | | - .discard(col="status", match="BANNED", mode="exact") |
67 | | - |
68 | | - # TRANSFORMATION: Smart Cleaning |
69 | | - # "$ 30.000,00" -> 30000 (Integer) |
70 | | - # If it fails (e.g., "Free"), send row to Quarantine |
71 | | - .cast("salary", type="int", clean=True, on_error="quarantine") |
72 | | - |
73 | | - # FUZZY FIXING: Fix typos ("Cihcago" -> "Chicago") |
74 | | - .fuzzyalign( |
75 | | - col="city", |
76 | | - ref=["Chicago", "Jakarta", "Shanghai"], |
77 | | - threshold=0.85 |
78 | | - ) |
79 | | - |
80 | | - # OUTPUT: Split into Clean Data & Audit Log |
81 | | - .quarantine("bad_data_audit.csv") |
82 | | - .dump("clean_data.parquet") |
| 62 | + engine.ingest("dirty_data.csv") |
| 63 | + .prune(col="email") # Drop rows if email is empty |
| 64 | + .discard("status", "BANNED", mode="exact") # Filter specific values |
| 65 | + .scrub("username", "trim") # Clean whitespace |
| 66 | + .scrub("salary", "currency") # Parse "Rp 5.000" to number |
| 67 | + .cast("salary", "int", clean=True) # Safely cast to Integer |
| 68 | + .fuzzyalign("city", ref=["Jakarta", "Bandung"], threshold=0.85) # Fix typos |
| 69 | + .quarantine("quarantine.csv") # Save bad data here |
| 70 | + .dump("clean_data.csv") # Save good data here |
83 | 71 | ) |
84 | 72 |
|
85 | | -# 4. Execute (Rust takes over) |
86 | | -stats = eng.exec([pipeline]) |
87 | | - |
88 | | -print(f"Processed: {stats.processed} rows") |
89 | | -print(f"Saved: {stats.saved} | Quarantined: {stats.quarantined}") |
| 73 | +# 3. Execute |
| 74 | +stats = engine.exec(pipeline) |
| 75 | +print(f"Processed: {stats.processed}, Saved: {stats.saved}") |
90 | 76 | ``` |
91 | 77 |
|
92 | | -<br> |
93 | | - |
94 | | -## 📊 Performance Benchmark |
95 | | - |
96 | | -Phaeton is optimized for "Dirty Data" scenarios (String parsing, Regex filtering, Fuzzy matching). |
97 | | - |
98 | | -**Test Environment:** |
99 | | -- **Dataset:** 1 Million Rows (Mixed dirty data: Typos, Currency strings, Encoding issues). |
100 | | -- **Hardware:** Entry Level Laptop. |
101 | | - |
102 | | -**Result:** |
103 | | -| Metric | Phaeton | |
104 | | -| :---: | :---: | |
105 | | -| Speed | ~575,000 rows/sec | |
106 | | -| Memory Usage | ~50MB (Constant) | |
107 | | -| Strategy | Parallel Streaming | |
| 78 | +--- |
108 | 79 |
|
109 | | -<br> |
| 80 | +## Installation |
110 | 81 |
|
111 | | -> Note: Phaeton maintains low memory footprint even when processing multi-gigabyte files due to its zero-copy streaming architecture. |
| 82 | +Phaeton provides **Universal Wheels (ABI3)**. No Rust compiler needed. |
| 83 | +```bash |
| 84 | +pip install phaeton |
| 85 | +``` |
| 86 | +> **Supported:** Python 3.8+ on Windows, Linux, and macOS (Intel & Apple Silicon). |
112 | 87 |
|
113 | | -<br> |
| 88 | +--- |
114 | 89 |
|
115 | | -## 📚 API Reference |
| 90 | +## API Reference |
116 | 91 |
|
117 | 92 | ### Root Module <br> |
118 | 93 | | Method | Description |
@@ -155,24 +130,27 @@ Methods to save the final results or handle rejected data. |
155 | 130 |
|
156 | 131 | --- |
157 | 132 |
|
158 | | -## 🗺️ Roadmap |
| 133 | +## Roadmap |
159 | 134 |
|
160 | | -Phaeton is currently in **Beta (v0.2.0)**. Here is the status of our development pipeline: |
| 135 | +Phaeton is currently in **Beta (v0.2.1)**. Here is the status of our development pipeline: |
161 | 136 |
|
162 | | -| Feature | Status | Notes | |
| 137 | +| Feature | Status | Implementation Notes | |
163 | 138 | | :--- | :---: | :--- | |
164 | | -| **Parallel Streaming Engine** | ✅ Ready | Powered by Rayon | |
165 | | -| **Smart Type Casting** | ✅ Ready | Auto-clean numeric strings | |
166 | | -| **Quarantine Logic** | ✅ Ready | Audit logs for bad data | |
167 | | -| **Fuzzy Alignment** | ✅ Ready | Jaro-Winkler / Levenshtein | |
168 | | -| **SHA-256 Hashing** | 📝 Planned | Security for PII data | |
169 | | -| **Column Splitting & Combining** | 📝 Planned | - | |
170 | | -| **Imputation (`.fill()`)** | 📝 Planned | Mean/Median/Mode fill | |
171 | | -| **Parquet/Arrow Integration** | 📝 Planned | Native output support | |
| 139 | +| **Parallel Streaming Engine** | ✅ Ready | Powered by Rust Rayon (Multi-core) | |
| 140 | +| **Regex & Filter Logic** | ✅ Ready | `keep`, `discard`, `prune` implemented | |
| 141 | +| **Smart Type Casting** | ✅ Ready | Auto-clean numeric strings (`"Rp 5,000"` -> `5000`) | |
| 142 | +| **Fuzzy Alignment** | ✅ Ready | Jaro-Winkler for typo correction | |
| 143 | +| **Quarantine System** | ✅ Ready | Full audit trail for rejected rows | |
| 144 | +| **Basic Text Scrubbing** | ✅ Ready | Trim, HTML strip, Case conversion | |
| 145 | +| **Header Normalization** | 🚧 In Progress | `snake_case`, `camelCase` conversions | |
| 146 | +| **Date Normalization** | 🚧 In Progress | Auto-detect & reformat dates | |
| 147 | +| **Deduplication** | 📝 Planned | Row-level & Column-level dedupe | |
| 148 | +| **Hashing & Anonymization** | 📝 Planned | SHA-256 for PII data | |
| 149 | +| **Parquet/Arrow Support** | 📝 Planned | Native output integration | |
172 | 150 |
|
173 | 151 | --- |
174 | 152 |
|
175 | | -## 🤝 Contributing |
| 153 | +## Contributing |
176 | 154 |
|
177 | 155 | This project is built with **Maturin** (PyO3 + Rust). Interested in contributing? |
178 | 156 |
|
|
0 commit comments