A complete end-to-end machine learning system that detects AI-generated text versus human-written text. This project combines a Python FastAPI backend with a Next.js frontend, PostgreSQL database, and a trained scikit-learn model using advanced stylistic feature extraction.
- Stylistic Analysis: Advanced ML model using 28 domain-independent stylistic features (word patterns, punctuation, vocabulary diversity, sentence structure, formality markers)
- Accurate Detection: 90% test accuracy with 7/8 out-of-domain text generalization
- Demo Samples: AI and human-written text samples to test the system
- Prediction History: Store and retrieve past predictions from PostgreSQL
- Modern UI: Beautiful Next.js frontend with Tailwind CSS
- RESTful API: FastAPI with Swagger documentation
- Environment Configuration: Secure
.envfile support for credentials - Production Ready: Clean architecture with proper error handling
- Python 3.14
- FastAPI - Web framework
- Scikit-learn - Machine learning (TF-IDF + Logistic Regression)
- Pandas - Data processing
- PostgreSQL - Database (optional, with JSON fallback)
- Uvicorn - ASGI server
- Next.js 15 - React framework with App Router
- TypeScript - Type safety
- Tailwind CSS - Styling
- React Hooks - State management
- Training Data: 487K AI vs Human text samples (305K human-written, 181K AI-generated)
- Algorithm: SGDClassifier with modified Huber loss
- Features: 28 stylistic features (no n-grams) - word length variance, formal words, contractions, first-person pronouns, hedging patterns, etc.
- Accuracy: 90.11% test accuracy
minor/
βββ backend/
β βββ app/
β β βββ routes/
β β β βββ predict.py # POST /predict, GET /history
β β β βββ news.py # GET /news (demo samples)
β β βββ services/
β β β βββ model_service.py # Stylistic feature extraction & inference
β β β βββ news_service.py # Demo text loader
β β β βββ db_service.py # PostgreSQL operations
β β βββ schemas/
β β βββ predict_schema.py
β βββ model/
β β βββ train.py # Training script (stylistic features)
β β βββ model.pkl # Trained SGDClassifier
β β βββ scaler.pkl # StandardScaler for features
β β βββ config.pkl # Model configuration
β βββ data/
β β βββ demo.json # Sample AI & human texts
β βββ database/
β β βββ schema.sql # PostgreSQL schema + seed data
β βββ venv/ # Python virtual environment
β βββ .env # Environment variables (not in git)
β βββ .env.example # Template for .env
β βββ .gitignore # Git ignore rules
β βββ main.py # FastAPI app entry point
β βββ requirements.txt # Python dependencies
β
βββ frontend/ # Next.js application
β βββ src/
β β βββ app/
β β β βββ page.tsx # Main page
β β β βββ layout.tsx # Root layout
β β β βββ globals.css # Global styles
β β βββ components/
β β β βββ Navbar.tsx # Navigation
β β β βββ PredictForm.tsx # Input form
β β β βββ NewsCard.tsx # Text sample display
β β β βββ HistoryPanel.tsx # Prediction history
β β β βββ AnalysisView.tsx # Main analysis UI
β β β βββ HomeView.tsx # Home view
β β β βββ Sidebar.tsx # Navigation sidebar
β β βββ types/
β β βββ index.ts # TypeScript types
β βββ package.json
β
βββ README.md # This file
βββ AI_Human.csv # AI vs Human text dataset (487K samples)
- Python 3.8+
- Node.js 18+
- PostgreSQL 12+ (optional)
- npm or yarn
# Navigate to backend directory
cd backend
# Create virtual environment
python3 -m venv venv
# Activate virtual environment
source venv/bin/activate # macOS/Linux
# OR
venv\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txtCreate a .env file in the backend/ directory:
cp .env.example .envEdit .env with your database credentials (optional):
DATABASE_URL=postgresql://postgres:postgres@localhost:5432/ai_text_detectionNote: The API works without PostgreSQL using JSON fallback for history.
# Create database
createdb ai_text_detection
# Run schema
psql -d ai_text_detection -f database/schema.sqlcd backend
source venv/bin/activate
python -m uvicorn main:app --reload --port 8000Backend will be available at: http://localhost:8000
cd frontend
npm install # Only needed first time
npm run devFrontend will be available at: http://localhost:3000
- GET
/- Health check
-
POST
/predict- Classify text as AI-generated, human-written, or uncertainRequest: { "text": "Your text passage here..." } Response: { "label": "human-written", "confidence": 0.8532 }
Label values:
"ai-generated","human-written", or"uncertain"(if confidence < 0.6) -
GET
/history- Get past predictions (requires PostgreSQL)Response: [ { "id": 1, "input_text": "Text sample...", "label": "ai-generated", "confidence": 0.95, "created_at": "2026-03-27T10:30:00" } ]
- GET
/news- Get demo text samples (10 AI, 10 human)Response: [ { "id": 1, "content": "Sample text..." } ]
- Load AI_Human.csv dataset (487K+ samples)
- Labels: 0 = Human-written, 1 = AI-generated
- Extract 28 stylistic features:
- Word patterns: avg word length, word length variance, long/short word ratios
- Sentence structure: avg sentence length, variance, sentence starter diversity
- Vocabulary: diversity (type-token ratio), hapax legomena ratio
- Formality markers: formal words (particularly, furthermore, etc.), contractions, first-person pronouns
- Casual markers: filler words (like, honestly, basically), hedging words, narrative words
- Punctuation patterns: ratio of commas, quotes, dashes, parentheticals
- Text properties: total length, paragraph structure
- No TF-IDF or n-grams β only domain-independent stylistic features
- Scale features with StandardScaler
- Train SGDClassifier (modified Huber loss, balanced classes)
- Save model.pkl, scaler.pkl, config.pkl
- User submits text via the frontend
- Frontend sends POST request to
/predict - Backend preprocesses: URL/email removal, whitespace normalization
- Extract 28 stylistic features from the text
- Scale features using saved StandardScaler
- Model predicts: 0 (human-written) or 1 (ai-generated)
- Returns label + confidence score
- (Optional) Saves to PostgreSQL database
Traditional n-gram approaches (TF-IDF) overfit to training topics and fail to generalize. Our model uses domain-independent stylistic markers that capture real AI vs human writing differences:
- Humans: More contractions, first-person pronouns, casual language, varied sentence structure
- AI: More formal vocabulary, fewer contractions, more consistent sentence patterns, hedging language
- Detect Tab: Paste text and get instant AI vs Human classification
- Demo Samples Tab: Browse AI-generated and human-written samples, click to analyze
- History Tab: View all past predictions (requires PostgreSQL)
- Real-time Feedback: Visual indicators (AI/HUMAN/UNCERTAIN) with confidence bars
- Responsive Design: Works on desktop, tablet, and mobile
- Modern Brand: AuthentiCheck UI with intuitive navigation
- Environment variables stored in
.env(git-ignored) - CORS enabled for development
- Graceful fallback when database is unavailable
- No sensitive data in version control
- Type-safe TypeScript frontend
- Training Data: 100K samples (50K human-written, 50K AI-generated) β subsampled from 487K total
- Train/Test Split: 80/20 (80K train, 20K test)
- Accuracy: 90.11% overall
- Precision (Human): 0.89
- Recall (Human): 0.91
- Precision (AI): 0.91
- Recall (AI): 0.89
- Generalization: 7/8 out-of-domain texts correctly classified
- Uncertainty Threshold: 0.60 (predictions below this marked as "uncertain")
Try these samples in the detector:
Human-Written Example:
I still remember the day my grandmother handed me her recipe notebook. It was worn, pages yellowed and stained with decades of cooking. Inside were recipes written in her handwriting with little notes in the margins.
AI-Generated Example:
Artificial intelligence represents a transformative force in contemporary society, fundamentally reshaping how organizations approach data analysis and decision-making processes. The integration of machine learning algorithms into business operations has engendered unprecedented efficiencies.
Casual Human Example:
Honestly today was a total disaster. I woke up late, spilled coffee all over my shirt, and then realized I had the wrong meeting time. My boss gave me that look you know the one?
- Hot reload enabled with
--reloadflag - Swagger docs available at: http://localhost:8000/docs
- ReDoc available at: http://localhost:8000/redoc
- Hot reload enabled with npm run dev
- TypeScript strict mode enabled
- Tailwind CSS with automatic optimization
See backend/requirements.txt for complete Python dependencies:
- fastapi
- uvicorn
- scikit-learn
- pandas
- psycopg2-binary
- python-dotenv
Port 8000 already in use:
lsof -i :8000
kill -9 <PID>Database connection failed:
- Ensure PostgreSQL is running
- Check DATABASE_URL in .env
- The app works without DB (uses JSON fallback)
Frontend can't reach backend:
- Ensure backend is running on port 8000
- Check
NEXT_PUBLIC_API_URLenv var in frontend - Clear browser cache
Model accuracy seems low:
- Check that model.pkl, scaler.pkl, and config.pkl exist in backend/model/
- Re-run
python model/train.pyto retrain the model - Note: 90% accuracy is expected for stylistic-only features
This project is open source and available under the MIT License.
- Fork the repository
- Create a feature branch
- Make your changes
- Submit a pull request
- Deep learning model (BERT/GPT-based perplexity detection)
- Multi-language support (non-English datasets)
- User authentication and profiles
- Advanced analytics dashboard with trends
- Model versioning and A/B testing
- API rate limiting and authentication
- Caching layer (Redis)
- Docker containerization
- Browser extension for real-time detection
- Integration with content management systems
Built with β€οΈ for detecting AI-generated text