Skip to content

VaidehiDeore/Plagiarism-Detector-String-Matching

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Plagiarism Detector Using String Matching Algorithms

Project Overview

The Plagiarism Detector Using String Matching Algorithms is a full-stack Data Structures & Algorithms project designed to identify exact, near-duplicate, and partially modified content between two documents. The system combines classic string matching algorithms with text similarity techniques to generate plagiarism scores, matched phrases, risk levels, and detailed reports.

The project uses a FastAPI backend for text processing and plagiarism analysis and a React + Vite frontend with a premium dark-themed dashboard for document upload, manual text input, similarity visualization, highlighted matches, and report generation.

This project demonstrates how DSA concepts can be applied to solve a real-world problem used in academic institutions, publishing platforms, EdTech systems, and content verification workflows.


Problem Statement

Plagiarism is a major challenge in educational institutions, research organizations, publishing industries, and content platforms. Manually checking documents for copied content becomes inefficient as document size and volume increase.

The objective of this project is to automate plagiarism detection by comparing an original document with a submitted document and generating similarity scores, plagiarism percentages, matched content, and algorithm-wise analysis reports.


Objectives

  • Detect copied content between documents
  • Identify exact phrase matches
  • Detect near-duplicate content
  • Calculate plagiarism percentage
  • Generate similarity reports
  • Highlight matched content
  • Demonstrate practical DSA applications
  • Build a GitHub-ready portfolio project

Real-World Applications

  • University Assignment Checking
  • Academic Integrity Systems
  • Research Paper Verification
  • Journal Submission Screening
  • Online Learning Platforms
  • Content Publishing Platforms
  • Corporate Document Verification
  • Blog Originality Checking
  • Recruitment Assessment Platforms
  • Educational Technology Solutions

DSA Concepts Used

This project demonstrates the practical implementation of:

  • String Matching Algorithms
  • Pattern Searching
  • Hashing
  • Rolling Hash
  • Sliding Window
  • Arrays
  • Hash Maps
  • Hash Sets
  • Text Tokenization
  • Shingling
  • Fingerprinting
  • Similarity Scoring
  • Document Comparison
  • Report Generation

Algorithms Used

1. Naive String Matching

Naive string matching compares a pattern against every possible position in the text.

Time Complexity: O(n × m)

Use in Project: Used as a baseline approach to identify exact matches.


2. KMP Algorithm (Knuth-Morris-Pratt)

KMP improves matching efficiency using the Longest Prefix Suffix (LPS) array to avoid repeated comparisons.

Time Complexity: O(n + m)

Use in Project: Efficient detection of exact phrase matches.


3. Rabin-Karp Algorithm

Rabin-Karp uses rolling hash techniques to compare text windows efficiently.

Time Complexity: Average O(n + m)

Use in Project: Detection of copied content through hash-based matching.


4. Shingling

Shingling divides documents into overlapping groups of words.

Example:

Machine Learning improves automation systems

3-word shingles:

Machine Learning improves
Learning improves automation
Improves automation systems

Use in Project: Near-duplicate content detection.


5. Winnowing

Winnowing generates fingerprints from shingles by selecting representative hashes.

Use in Project: Efficient similarity detection and fingerprint comparison.


6. Jaccard Similarity

Formula:

Jaccard Similarity = Intersection / Union

Use in Project: Compares shingle sets between documents.


7. TF-IDF Cosine Similarity

TF-IDF measures word importance while cosine similarity compares document vectors.

Use in Project: Measures overall document similarity even when wording changes slightly.


Features

  • Upload Original Document
  • Upload Submitted Document
  • Manual Text Input
  • Dynamic User Inputs
  • Exact Match Detection
  • Near-Duplicate Detection
  • Plagiarism Percentage Calculation
  • Similarity Gauge
  • Risk Level Identification
  • Highlighted Matched Content
  • Algorithm Score Breakdown
  • JSON Report Generation
  • CSV Report Generation
  • PDF Report Generation
  • Premium Dark Theme Dashboard
  • Responsive User Interface
  • Lightweight Architecture
  • Beginner-Friendly Project Structure

Tech Stack

Frontend

  • React
  • Vite
  • JavaScript
  • CSS

Backend

  • Python
  • FastAPI
  • Uvicorn

Algorithms

  • Naive Matching
  • KMP
  • Rabin-Karp
  • Rolling Hash
  • Shingling
  • Winnowing
  • Jaccard Similarity
  • TF-IDF Cosine Similarity

Development Tools

  • VS Code
  • Git
  • GitHub
  • PowerShell

System Workflow

Input Documents
      ↓
Text Preprocessing
      ↓
Tokenization
      ↓
Naive Matching
      ↓
KMP Matching
      ↓
Rabin-Karp Matching
      ↓
Shingling
      ↓
Winnowing
      ↓
Jaccard Similarity
      ↓
TF-IDF Similarity
      ↓
Similarity Calculation
      ↓
Matched Content Extraction
      ↓
Plagiarism Percentage
      ↓
Report Generation
      ↓
Dashboard Visualization

Folder Structure

Plagiarism-Detector-String-Matching/
│
├── backend/
│   ├── main.py
│   ├── requirements.txt
│   ├── outputs/
│   └── reports/
│
├── frontend/
│   ├── src/
│   │   ├── App.jsx
│   │   ├── main.jsx
│   │   └── style.css
│   ├── index.html
│   └── package.json
│
├── images/
│   ├── upload-page.png
│   ├── documents-loaded.png
│   ├── plagiarism-analysis.png
│   ├── highlighted-matches.png
│   └── report-generated.png
│
├── README.md
├── .gitignore
└── docs/

Screenshots

Upload Page

Upload Page

Documents Loaded

Documents Loaded

Plagiarism Analysis

Plagiarism Analysis

Highlighted Matches

Highlighted Matches

Report Generated

Report Generated


How To Run

Backend Setup

cd backend

python -m venv venv

venv\Scripts\activate

pip install -r requirements.txt

uvicorn main:app --reload

Backend URL:

http://127.0.0.1:8000

Frontend Setup

cd frontend

npm install

npm run dev

Frontend URL:

http://127.0.0.1:5173

Sample Input

Original Document

Artificial Intelligence is transforming modern industries.

Machine Learning is a subset of Artificial Intelligence.

Data structures and algorithms are important for software development.

Plagiarism detection systems help maintain academic integrity.

Submitted Document

Artificial Intelligence is transforming modern industries.

Machine Learning is a branch of Artificial Intelligence.

Data structures and algorithms are important for software development.

Plagiarism detection systems help maintain academic integrity.

Sample Output

Overall Similarity: 68%

Risk Level: MEDIUM

Matched Phrases: 3

Algorithms Used:
- KMP
- Rabin-Karp
- Shingling
- Winnowing
- Jaccard Similarity
- TF-IDF Cosine Similarity

Outputs Generated

backend/outputs/similarity_result.json
backend/outputs/analysis_summary.csv
backend/outputs/matched_phrases.txt
backend/reports/plagiarism_report.pdf

These outputs provide proof of plagiarism analysis beyond the dashboard.


Learning Outcomes

Through this project I learned:

  • Practical implementation of string matching algorithms
  • Hashing and rolling hash techniques
  • Document fingerprinting using Winnowing
  • Similarity scoring using Jaccard and TF-IDF
  • FastAPI backend development
  • React frontend integration
  • Building GitHub-ready DSA projects
  • Applying DSA concepts to solve real-world problems

Future Enhancements

  • DOCX Support
  • Improved PDF Parsing
  • MinHash Implementation
  • Locality Sensitive Hashing (LSH)
  • Semantic Similarity Models
  • Multi-document Comparison
  • Database Integration
  • User Authentication
  • Historical Analysis Storage
  • Cloud Deployment

Conclusion

The Plagiarism Detector Using String Matching Algorithms demonstrates how Data Structures and Algorithms can be applied to solve a practical and widely used real-world problem. By combining exact string matching, rolling hash techniques, fingerprinting methods, and similarity scoring algorithms, the system provides an efficient plagiarism detection pipeline while showcasing important DSA concepts for academic and professional applications.


Author

Vaidehi Deore

About

A DSA-based plagiarism detection system using KMP, Rabin-Karp, Shingling, Winnowing, Jaccard Similarity, and TF-IDF with FastAPI backend and premium React dashboard.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors