The Plagiarism Detector Using String Matching Algorithms is a full-stack Data Structures & Algorithms project designed to identify exact, near-duplicate, and partially modified content between two documents. The system combines classic string matching algorithms with text similarity techniques to generate plagiarism scores, matched phrases, risk levels, and detailed reports.
The project uses a FastAPI backend for text processing and plagiarism analysis and a React + Vite frontend with a premium dark-themed dashboard for document upload, manual text input, similarity visualization, highlighted matches, and report generation.
This project demonstrates how DSA concepts can be applied to solve a real-world problem used in academic institutions, publishing platforms, EdTech systems, and content verification workflows.
Plagiarism is a major challenge in educational institutions, research organizations, publishing industries, and content platforms. Manually checking documents for copied content becomes inefficient as document size and volume increase.
The objective of this project is to automate plagiarism detection by comparing an original document with a submitted document and generating similarity scores, plagiarism percentages, matched content, and algorithm-wise analysis reports.
- Detect copied content between documents
- Identify exact phrase matches
- Detect near-duplicate content
- Calculate plagiarism percentage
- Generate similarity reports
- Highlight matched content
- Demonstrate practical DSA applications
- Build a GitHub-ready portfolio project
- University Assignment Checking
- Academic Integrity Systems
- Research Paper Verification
- Journal Submission Screening
- Online Learning Platforms
- Content Publishing Platforms
- Corporate Document Verification
- Blog Originality Checking
- Recruitment Assessment Platforms
- Educational Technology Solutions
This project demonstrates the practical implementation of:
- String Matching Algorithms
- Pattern Searching
- Hashing
- Rolling Hash
- Sliding Window
- Arrays
- Hash Maps
- Hash Sets
- Text Tokenization
- Shingling
- Fingerprinting
- Similarity Scoring
- Document Comparison
- Report Generation
Naive string matching compares a pattern against every possible position in the text.
Time Complexity: O(n × m)
Use in Project: Used as a baseline approach to identify exact matches.
KMP improves matching efficiency using the Longest Prefix Suffix (LPS) array to avoid repeated comparisons.
Time Complexity: O(n + m)
Use in Project: Efficient detection of exact phrase matches.
Rabin-Karp uses rolling hash techniques to compare text windows efficiently.
Time Complexity: Average O(n + m)
Use in Project: Detection of copied content through hash-based matching.
Shingling divides documents into overlapping groups of words.
Example:
Machine Learning improves automation systems
3-word shingles:
Machine Learning improves
Learning improves automation
Improves automation systems
Use in Project: Near-duplicate content detection.
Winnowing generates fingerprints from shingles by selecting representative hashes.
Use in Project: Efficient similarity detection and fingerprint comparison.
Formula:
Jaccard Similarity = Intersection / Union
Use in Project: Compares shingle sets between documents.
TF-IDF measures word importance while cosine similarity compares document vectors.
Use in Project: Measures overall document similarity even when wording changes slightly.
- Upload Original Document
- Upload Submitted Document
- Manual Text Input
- Dynamic User Inputs
- Exact Match Detection
- Near-Duplicate Detection
- Plagiarism Percentage Calculation
- Similarity Gauge
- Risk Level Identification
- Highlighted Matched Content
- Algorithm Score Breakdown
- JSON Report Generation
- CSV Report Generation
- PDF Report Generation
- Premium Dark Theme Dashboard
- Responsive User Interface
- Lightweight Architecture
- Beginner-Friendly Project Structure
- React
- Vite
- JavaScript
- CSS
- Python
- FastAPI
- Uvicorn
- Naive Matching
- KMP
- Rabin-Karp
- Rolling Hash
- Shingling
- Winnowing
- Jaccard Similarity
- TF-IDF Cosine Similarity
- VS Code
- Git
- GitHub
- PowerShell
Input Documents
↓
Text Preprocessing
↓
Tokenization
↓
Naive Matching
↓
KMP Matching
↓
Rabin-Karp Matching
↓
Shingling
↓
Winnowing
↓
Jaccard Similarity
↓
TF-IDF Similarity
↓
Similarity Calculation
↓
Matched Content Extraction
↓
Plagiarism Percentage
↓
Report Generation
↓
Dashboard Visualization
Plagiarism-Detector-String-Matching/
│
├── backend/
│ ├── main.py
│ ├── requirements.txt
│ ├── outputs/
│ └── reports/
│
├── frontend/
│ ├── src/
│ │ ├── App.jsx
│ │ ├── main.jsx
│ │ └── style.css
│ ├── index.html
│ └── package.json
│
├── images/
│ ├── upload-page.png
│ ├── documents-loaded.png
│ ├── plagiarism-analysis.png
│ ├── highlighted-matches.png
│ └── report-generated.png
│
├── README.md
├── .gitignore
└── docs/
cd backend
python -m venv venv
venv\Scripts\activate
pip install -r requirements.txt
uvicorn main:app --reloadBackend URL:
http://127.0.0.1:8000
cd frontend
npm install
npm run devFrontend URL:
http://127.0.0.1:5173
Artificial Intelligence is transforming modern industries.
Machine Learning is a subset of Artificial Intelligence.
Data structures and algorithms are important for software development.
Plagiarism detection systems help maintain academic integrity.
Artificial Intelligence is transforming modern industries.
Machine Learning is a branch of Artificial Intelligence.
Data structures and algorithms are important for software development.
Plagiarism detection systems help maintain academic integrity.
Overall Similarity: 68%
Risk Level: MEDIUM
Matched Phrases: 3
Algorithms Used:
- KMP
- Rabin-Karp
- Shingling
- Winnowing
- Jaccard Similarity
- TF-IDF Cosine Similarity
backend/outputs/similarity_result.json
backend/outputs/analysis_summary.csv
backend/outputs/matched_phrases.txt
backend/reports/plagiarism_report.pdf
These outputs provide proof of plagiarism analysis beyond the dashboard.
Through this project I learned:
- Practical implementation of string matching algorithms
- Hashing and rolling hash techniques
- Document fingerprinting using Winnowing
- Similarity scoring using Jaccard and TF-IDF
- FastAPI backend development
- React frontend integration
- Building GitHub-ready DSA projects
- Applying DSA concepts to solve real-world problems
- DOCX Support
- Improved PDF Parsing
- MinHash Implementation
- Locality Sensitive Hashing (LSH)
- Semantic Similarity Models
- Multi-document Comparison
- Database Integration
- User Authentication
- Historical Analysis Storage
- Cloud Deployment
The Plagiarism Detector Using String Matching Algorithms demonstrates how Data Structures and Algorithms can be applied to solve a practical and widely used real-world problem. By combining exact string matching, rolling hash techniques, fingerprinting methods, and similarity scoring algorithms, the system provides an efficient plagiarism detection pipeline while showcasing important DSA concepts for academic and professional applications.




