🚀 Neural Ninjas - Dynamic ETL Pipeline

AI-Powered Data Processing | Zero Configuration | Production Ready

Upload any JSON/CSV file and watch as our intelligent pipeline automatically infers schemas, detects data types, tracks changes, removes duplicates, and loads everything into MongoDB with full versioning!

🌟 Overview

Neural Ninjas is an intelligent ETL (Extract, Transform, Load) pipeline that eliminates manual schema definition and data processing configuration. Built for OSC Hackathon 2025, this system uses AI-powered type detection and smart algorithms to handle dynamic data from any source.

The Problem We Solve

❌ Manual schema definition is time-consuming
❌ Data types are inconsistent across sources
❌ Duplicate records waste storage
❌ Tracking data changes manually is tedious
❌ Schema evolution breaks existing pipelines

Our Solution

✅ AI-Powered Type Detection - Automatically detects 8 data types
✅ Schema Evolution - Adapts to new fields automatically
✅ Smart Deduplication - Prevents duplicate records
✅ Change Tracking - Monitors key field changes
✅ Data Normalization - Standardizes dates, emails, numbers
✅ Full Versioning - Complete audit trail in MongoDB

✨ Key Features

🤖 AI-Powered Intelligence

1. Intelligent Type Detection

Automatically detects and classifies data into 8 types:

integer - Whole numbers (42, 100, -5)
float - Decimal numbers (42.5, 99.99, 3.14)
string - Text data ("Hello", "World")
email - Email addresses (user@example.com)
date - Multiple formats (YYYY-MM-DD, DD/MM/YYYY, DD-MM-YYYY)
url - Web URLs (https://example.com)
boolean - True/False values (yes, no, 1, 0, true, false)
null - Empty or missing values

2. Schema Evolution

Automatically adapts as new fields appear
Maintains backward compatibility
Stores sample values for reference
Type priority system handles conflicts

3. Data Normalization

Emails → Lowercase (ALICE@TEST.COM → alice@test.com)
Dates → Standardized format (15/02/2023 → 2023-02-15)
Numbers → Proper typing ("42" → 42, "42.5" → 42.5)
Booleans → True/False values ("yes" → True, "1" → True)

🔍 Advanced Tracking

4. Schema Versioning System

Every unique schema saved with version number
Tracks creation and last-used timestamps
Reuses existing schemas when structure matches
Complete schema history in MongoDB

5. Change Detection

Automatically monitors changes in key fields:

price - E-commerce price tracking
discount - Offer monitoring
score - Performance metrics
rating - Review tracking
salary - HR data monitoring

Example:

Existing: {"name": "Alice", "price": 100, "score": 85}
New:      {"name": "Alice", "price": 120, "score": 90}
Detected: price: 100 → 120 (+20), score: 85 → 90 (+5)

6. Smart Deduplication

Checks within batch (in-memory)
Checks against database (existing records)
Uses identifier fields: name, user, email, id
Reports number of duplicates skipped

💎 Production Ready

Batch Processing - Handles large datasets efficiently (configurable batch size)
Error Handling - Robust error management throughout
Comprehensive Logging - All operations logged with timestamps
Beautiful UI - Modern dashboard with real-time statistics
MongoDB Integration - Scalable NoSQL storage with 3 collections
Metadata Tracking - _loaded_at timestamp on all records

📁 Project Structure

neural-ninjas/
├── backend/                      # Python Backend
│   ├── app.py                   # Main Flask application
│   ├── extract.py               # Data extraction module
│   ├── transform.py             # Data transformation & type detection
│   ├── load.py                  # MongoDB loading & versioning
│   ├── config.py                # Configuration settings
│   ├── requirements.txt         # Python dependencies
│   ├── launch.py                # Application launcher
│   ├── run_server.py            # Server runner
│   ├── start_localhost.sh       # Quick start script
│   ├── run_tests.sh             # Test runner
│   ├── test_backend.py          # Backend unit tests
│   ├── test_flask_upload.py     # Integration tests
│   ├── test_categorized.py      # Categorization tests
│   ├── test_data_complete.json  # Test data (full features)
│   ├── test_data_modified.json  # Test data (change detection)
│   ├── sample.json              # Sample JSON file
│   ├── sample.csv               # Sample CSV file
│   ├── sample1.csv              # Additional test data
│   └── *.log                    # Log files
│
├── frontend/                     # HTML/CSS Frontend
│   ├── templates/               # Jinja2 templates
│   │   └── index.html          # Main dashboard UI
│   └── style.css               # Global styles
│
├── .git/                        # Git repository
├── .gitignore                   # Git ignore rules
└── README.md                    # This file

🛠️ Tech Stack

Backend

Flask - Lightweight Python web framework
Python 3.7+ - Core programming language
PyMongo - MongoDB driver for Python
MongoDB - NoSQL database for storage
Regex - Pattern matching for type detection
Batch Processing - Efficient large dataset handling

Frontend

HTML5 - Modern markup
CSS3 - Styling with gradients and animations
Jinja2 - Template engine
JavaScript - Client-side interactivity

Database Collections

The system creates 3 MongoDB collections:

entries - Main data storage
- All processed records
- Includes _loaded_at metadata

schema_versions - Schema history

{
  "_id": ObjectId("..."),
  "version": 1,
  "schema": {
    "name": {"type": "string", "sample_values": ["Alice", "Bob"]},
    "age": {"type": "integer", "sample_values": [25, 30]}
  },
  "created_at": "2025-11-16T10:00:00Z",
  "last_used": "2025-11-16T10:30:00Z",
  "stats": {
    "total_records": 100,
    "total_fields": 5
  }
}

data_changes - Change tracking

{
  "identifier": {"name": "Alice"},
  "field": "price",
  "old_value": 100,
  "new_value": 120,
  "timestamp": "2025-11-16T10:30:00Z",
  "change_type": "update"
}

🚀 Getting Started

Prerequisites

Python 3.7+ installed
MongoDB running locally or connection URI
pip package manager

Installation

Step 1: Clone the Repository

git clone https://github.com/Algoace1403/Neural-Ninjas.git
cd Neural-Ninjas

Step 2: Install Dependencies

cd backend
pip install -r requirements.txt

Dependencies:

flask
pymongo

Step 3: Start MongoDB

Option A - Local MongoDB:

mongod

Option B - MongoDB Compass:

Open MongoDB Compass application
Connect to localhost:27017

Option C - Cloud MongoDB: Edit backend/config.py with your connection string

Step 4: Run the Application

# From backend directory
python app.py

Or use the quick start script:

cd backend
./start_localhost.sh

Step 5: Open Browser

Navigate to: http://127.0.0.1:5000

📊 Usage Guide

Basic Workflow

Upload a File
- Drag & drop or click to browse
- Supports JSON and CSV formats
- No schema definition needed!
Watch AI Process
- Type detection runs automatically
- Schema inferred and displayed
- Statistics updated in real-time
View Results
- See detected field types with color-coded badges
- Review sample values
- Check processing statistics
Check MongoDB
- Open MongoDB Compass
- View entries collection for data
- Check schema_versions for schema history
- See data_changes for change tracking

Testing Features

Test 1: Type Detection

# Upload backend/test_data_complete.json
# Expected: Detects integer, float, email, date, url, boolean types
# Check: MongoDB Compass → schema_versions collection

Test 2: Deduplication

# 1. Upload test_data_complete.json
# 2. Upload the same file again
# Expected: "X duplicates skipped" message

Test 3: Change Detection

# 1. Upload test_data_complete.json
# 2. Upload test_data_modified.json (has changed prices/scores)
# Expected: Changes table shows old vs new values
# Check: MongoDB → data_changes collection

Test 4: Schema Evolution

# 1. Upload sample.json (different structure)
# 2. Upload test_data_complete.json (different fields)
# Expected: Schema version increments (v1 → v2)

🎨 UI Features

Dashboard Components

File Upload Interface
- Drag & drop support
- File type validation
- Upload progress indication
Real-Time Statistics Cards
- Records Inserted
- Total Fields Detected
- Schema Version
- Duplicates Skipped (conditional)
- Changes Detected (conditional)
Schema Table
- Field names
- Color-coded type badges
- Sample values
Change Detection Table
- Field name
- Old value
- New value (highlighted in red)
- Change type

Color-Coded Type Badges

🔵 integer - Blue (#1976d2)
🟣 float - Purple (#7b1fa2)
🟢 string - Green (#388e3c)
🟠 email - Orange (#f57c00)
🩷 date - Pink (#c2185b)
🔷 boolean - Teal (#00796b)
🔵 url - Light Blue (#0277bd)

🔧 Configuration

Edit backend/config.py to customize:

# MongoDB Configuration
MONGO_URI = "mongodb://localhost:27017/"
MONGO_DB = "hackathon_db"
MONGO_COLLECTION = "entries"
MONGO_SCHEMA_COLLECTION = "schema_versions"
MONGO_CHANGES_COLLECTION = "data_changes"

# Processing Configuration
BATCH_SIZE = 1000  # Records per batch

📈 Use Cases

Perfect for:

🛒 E-commerce - Price and product tracking
📊 Data Aggregation - Multi-source data integration
🌐 Web Scraping - Automated data pipelines
📡 API Collection - Third-party API data ingestion
🔄 Data Synchronization - Real-time data sync
💰 Financial Tracking - Stock prices, crypto, forex
⭐ Review Monitoring - Rating and sentiment tracking
📈 Performance Metrics - KPI and analytics tracking

🧪 Testing

Manual Testing Checklist

Basic Upload:

Upload JSON file succeeds
Upload CSV file succeeds
Invalid file shows error message

Type Detection:

Integer fields detected correctly
Float fields detected correctly
Email fields detected correctly
Date fields detected correctly
Boolean fields detected correctly
URL fields detected correctly

Schema Versioning:

First upload creates v1
Same structure reuses v1
New fields create v2
MongoDB has schema_versions collection

Deduplication:

Duplicate records skipped
UI shows duplicate count
Only unique records in database

Change Detection:

Price changes detected
Score changes detected
UI shows change table
MongoDB has data_changes collection

Automated Tests

Run the test suite:

cd backend
./run_tests.sh

Or run individual tests:

python test_backend.py      # Backend unit tests
python test_flask_upload.py # Integration tests
python test_categorized.py  # Categorization tests

🎯 Performance

Batch Processing: 1000 records per batch (configurable)
Type Detection: ~0.1ms per value
Schema Inference: O(n) complexity
Deduplication: O(n) with hash-based lookup
MongoDB Inserts: Bulk operations for efficiency
Memory Usage: Streaming for large files

🐛 Troubleshooting

Port Already in Use

# Kill process on port 5000
lsof -ti:5000 | xargs kill -9

# Or use a different port
# Edit app.py: app.run(port=5001)

MongoDB Connection Error

# Check if MongoDB is running
pgrep mongod

# Start MongoDB
mongod

# Or use MongoDB Compass to start server

Module Not Found

# Install dependencies
cd backend
pip install -r requirements.txt

Permission Denied

# Make scripts executable
chmod +x backend/*.sh
chmod +x backend/*.py

🚀 Future Enhancements

Planned Features

Phase 1 (Next 10-15 hours):

REST API endpoints for programmatic access
Basic ML anomaly detection
Web scraping module (Scrapy/BeautifulSoup)
Alert system (email/SMS notifications)

Phase 2 (Future):

Advanced ML models for missing value prediction
NLP for semantic understanding
Streamlit/Dash advanced dashboard
Task queue (Celery/RabbitMQ)
Microservices architecture
Real-time streaming data support
Data quality scoring
Custom transformation rules UI

📄 License

This project is licensed under the MIT License.

MIT License

Copyright (c) 2025 Neural Ninjas Team

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

👥 Team

Created with ❤️ by Neural Ninjas

OSC Hackathon 2025

⭐ If you found this project helpful, give it a star!

📊 Project Stats

Total Lines of Code:        8,000+
Python Files:               30+
Frontend Templates:         1
MongoDB Collections:        3
Type Detection Patterns:    8
Test Cases:                12
Features Implemented:       8 core features
Development Time:           48 hours

Built with passion during OSC Hackathon 2025

AI-Powered | Zero Configuration | Production Ready

⬆ Back to Top

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
backend		backend
frontend		frontend
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md

Folders and files

Latest commit

History

Repository files navigation

🚀 Neural Ninjas - Dynamic ETL Pipeline

🌟 Overview

The Problem We Solve

Our Solution

✨ Key Features

🤖 AI-Powered Intelligence

1. Intelligent Type Detection

2. Schema Evolution

3. Data Normalization

🔍 Advanced Tracking

4. Schema Versioning System

5. Change Detection

6. Smart Deduplication

💎 Production Ready

📁 Project Structure

🛠️ Tech Stack

Backend

Frontend

Database Collections

🚀 Getting Started

Prerequisites

Installation

Step 1: Clone the Repository

Step 2: Install Dependencies

Step 3: Start MongoDB

Step 4: Run the Application

Step 5: Open Browser

📊 Usage Guide

Basic Workflow

Testing Features

Test 1: Type Detection

Test 2: Deduplication

Test 3: Change Detection

Test 4: Schema Evolution

🎨 UI Features

Dashboard Components

Color-Coded Type Badges

🔧 Configuration

📈 Use Cases

🧪 Testing

Manual Testing Checklist

Automated Tests

🎯 Performance

🐛 Troubleshooting

Port Already in Use

MongoDB Connection Error

Module Not Found

Permission Denied

🚀 Future Enhancements

Planned Features

📄 License

👥 Team

Created with ❤️ by Neural Ninjas

⭐ If you found this project helpful, give it a star!

📊 Project Stats

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages