OpenEngine is an end-to-end autonomous pipeline designed for the ingestion and semantic transformation of unstructured and multimodal data streams—ranging from legacy commercial records (PDF, Tally) to high-frequency industrial telemetry (sensor data). The system employs an Agentic Orchestration Layer that performs Dynamic Tool Selection and Schema Inference to normalize disparate data formats into a unified structural representation. By integrating Multimodal Encoders with Automated Exploratory Data Analysis (Auto-EDA), the framework synthesizes high-dimensional insights, automated visualizations, and structured reports, effectively bridging the gap between raw industrial noise and actionable business intelligence.
- Objective
Build an autonomous system capable of:
- Ingesting heterogeneous data (text, PDFs, images, tabular, sensor streams)
- Understanding data type and structure
- Dynamically selecting processing pipelines
- Generating insights, summaries, and visualizations
To build a "Generalist Data Scientist" agentic system that transforms raw, unformatted, and multimodal input into structured, interpretable business logic.
- Agentic Orchestration: LangGraph (preferred for stateful, cyclic flows) or CrewAI.
- Multimodal Ingestion:
Unstructured.io(for partitioning PDFs/images),LlamaIndex(for data connectors). - Data Engineering Backbone:
PySparkorPolars(for high-performance data manipulation). - Vector & Structured Storage:
PostgreSQLwithpgvector(for hybrid search) andChromaDB.
- Modality Router — detects file type and routes to correct parser
- Schema Inference Engine — auto-detects column types, data distributions, nulls
- Data Normalization Layer — outputs clean, structured DataFrames or JSON regardless of input
| Layer | Technology |
|---|---|
| Agent Orchestration | LangGraph |
| data framework | LlamaIndex |
| LLM | HuggingFace and Groq |
| PDF Parsing | PyMuPDF, unstructured.io |
| OCR | Tesseract (pytesseract) |
| Data Processing | Pandas, Polars |
| Vector Store | ChromaDB |
| Embeddings | OpenAI text-embedding-3-small |
| ML Analysis | scikit-learn, statsmodels, SHAP |
| Time-series | Prophet, tsfresh |
| Visualization | Plotly |
| API | FastAPI |
| Frontend | Next.js (React) |
| Database | PostgreSQL + SQLAlchemy |
| Containerization | Docker + Docker Compose |
| Infra | Kubernetes |
| Package Manager | pip (Python) |
| What you build | What you learn |
|---|---|
| Modality router + ingestion agents | Agentic orchestration + tool use |
| Schema inference engine | Unstructured data handling |
| RAG pipeline | Retrieval-augmented generation |
| LLM insight generation | Foundation model adaptation / prompting |
| Sensor anomaly detection | ML model training + time-series |
| Full pipeline design | End-to-end data engineering |
- Goal: Build a gateway that identifies "what" the data is without user labels.
- Task: Implement a classifier agent that uses zero-shot classification to route files (e.g., "This is a CSV of sales" vs "This is a blurred image of a receipt").
- Key Concept: Metadata Extraction and File Fingerprinting.
- Goal: Convert "Dirty" data into "Clean" relational tables.
- Task: Develop agents that perform Schema Alignment. If a Tally export and an Excel sheet both contain "Price," the agent must map them to a unified
unit_costcolumn. - Key Concept: Ontology Mapping & Data Cleaning Agents.
-
Goal: Perform the actual "Data Science" work autonomously.
-
Task: An agent that writes and executes Python code (Sandboxed) to generate descriptive statistics, trend lines, and correlation matrices.
-
Key Concept: Automated Code Generation & Execution (ACE).
-
Statistical Analysis:
- Descriptive stats: mean, median, std, skewness, correlation
- Anomaly detection: IsolationForest, Z-score
- Trend detection: rolling averages, seasonal decomposition (statsmodels)
-
ML-Based Insights (Optional but powerful)
-
Clustering similar records: KMeans, DBSCAN
-
Forecasting (for time-series / sensor data): Prophet or LSTM
-
Feature importance for tabular data: SHAP values
-
LLM-Based Insight Generation
- Pass analysis results + data summary to LLM
- Prompt it to generate: executive summary, key findings, risk flags, recommendations
- Use structured output (JSON) for consistent report formatting
-
API & Frontend: FastAPI — async, fast, easy to document with Swagger Endpoints:
- POST /upload — accepts multimodal files
- GET /status/{job_id} — pipeline progress
- GET /report/{job_id} — fetch final report
- GET /visualizations/{job_id} — fetch charts
- Goal: Make the data human-readable.
- Task: Use a specialized agent to select the best visualization (e.g., "Use a Heatmap for sensor fluctuations") and generate a LaTeX or Markdown executive summary.
- Mapping Accuracy: How correctly did the system infer the schema?
- Hallucination Rate: Measuring the fidelity of the "Insight Summary" against the raw numbers.
- System Latency: Optimizing the agentic loop for real-time sensor processing.