Advanced Source Code Plagiarism Detection Engine using Clang Tokenization, Rolling Hashing, Winnowing Fingerprinting, Suffix Automata and Levenshtein Similarity.
CopyCat Crusher is a high-performance source code plagiarism detection system designed to identify copied, modified, and structurally similar programs.
Unlike simple text comparison tools, CopyCat Crusher analyzes normalized source code structures and combines multiple plagiarism detection techniques to detect similarities even when variable names, literals, formatting, or comments have been changed.
The system is particularly useful for:
- Academic plagiarism detection
- Programming assignment evaluation
- Coding competition integrity checks
- Source code similarity analysis
- Software audit and review systems
Traditional plagiarism detectors often fail when students perform simple obfuscation techniques such as:
- Renaming variables
- Renaming functions
- Changing numeric constants
- Modifying formatting
- Rearranging comments
- Altering whitespace
Example:
int sum = 0;
for(int i=0;i<100;i++){
sum += i;
}int result = 0;
for(int index=0;index<100;index++){
result += index;
}Although these programs are nearly identical, a naïve text comparison system may treat them as different.
CopyCat Crusher normalizes source code before analysis and can identify such similarities effectively.
- Clang-based source code tokenization
- Language-aware token normalization
- Rolling Hash based sequence matching
- Winnowing Fingerprinting algorithm
- Suffix Automaton based longest common block detection
- Levenshtein Distance similarity scoring
- Weighted similarity fusion engine
- Automated batch analysis
- HTML report generation
- Multi-file comparison support
- C++17 implementation
- LLVM/libclang integration
Source Files
│
▼
Clang Tokenizer
│
▼
Token Normalizer
│
▼
─────────────────────────────
│ Rolling Hash │
│ Winnowing Fingerprinting │
│ Suffix Automaton │
│ Levenshtein Similarity │
─────────────────────────────
│
▼
Similarity Fusion Engine
│
▼
HTML Report Generator
│
▼
Final Plagiarism Report
The project uses LLVM's libclang API to generate language-aware tokens from source code.
Benefits:
- Ignores formatting differences
- Understands C++ syntax
- Produces reliable token streams
- Handles complex language constructs
Identifiers and literals are normalized before comparison.
Example:
Original:
int marks = 95;Normalized:
TYPE VAR = NUM_LITERAL ;
This removes superficial differences and focuses on program structure.
Rolling Hash is used to efficiently generate hash values for token windows.
Benefits:
- Fast similarity detection
- O(N) processing
- Supports large submissions
The Winnowing algorithm selects representative fingerprints from rolling hashes.
Benefits:
- Noise reduction
- Efficient storage
- Robust plagiarism detection
Used to detect the longest common token block shared between two submissions.
Benefits:
- Detects copied code segments
- Linear-time processing
- Effective against partial plagiarism
Measures edit distance between normalized token sequences.
Benefits:
- Detects near matches
- Handles minor modifications
- Captures structural similarity
The final plagiarism score is computed using a weighted ensemble of multiple algorithms.
Final Score =
Exact Match Score
+ Fingerprint Score
+ Longest Match Score
+ Levenshtein Score
The combined score provides a more reliable plagiarism estimate than any individual technique.
CopyCatCrusher/
│
├── include/
│ ├── algorithms/
│ ├── core/
│ ├── models/
│ ├── normalizer/
│ ├── report/
│ ├── similarity/
│ └── tokenizer/
│
├── src/
│ ├── algorithms/
│ ├── core/
│ ├── normalizer/
│ ├── report/
│ ├── similarity/
│ ├── tokenizer/
│ └── main.cpp
│
├── submissions/
├── reports/
│
├── CMakeLists.txt
├── LICENSE
└── README.md
- C++17 Compatible Compiler
- CMake 3.20+
- LLVM
- libclang
git clone https://github.com/7vik2005/CopyCat-Crushers.git
cd CopyCat-Crusherscmake -G "MinGW Makefiles" -S . -B build
cmake --build buildmkdir build
cd build
cmake ..
makePlace all source files inside:
submissions/
Example:
submissions/
├── student1.cpp
├── student2.cpp
├── student3.cpp
Run:
./build/CopyCatCrusher.exe submissions reportsView Results
Console Output:
============================================================
CopyCat Crusher Results
============================================================
Comparisons Generated: 3
Source Target Score (%)
student1.cpp student2.cpp 100.00
student1.cpp student3.cpp 81.62
student2.cpp student3.cpp 81.62
For every comparison, an HTML report is automatically generated.
Example:
reports/
├── comparison_0.html
├── comparison_1.html
├── comparison_2.html
Each report contains:
- Similarity Score
- Risk Level
- Exact Match Score
- Fingerprint Similarity
- Longest Common Block
- Levenshtein Similarity
- Token Statistics
| Score | Risk Level |
|---|---|
| 0% - 30% | Low |
| 30% - 60% | Medium |
| 60% - 80% | High |
| 80% - 100% | Critical |
Typical Complexity:
| Component | Complexity |
|---|---|
| Tokenization | O(N) |
| Normalization | O(N) |
| Rolling Hash | O(N) |
| Winnowing | O(N) |
| Suffix Automaton | O(N) |
| Levenshtein | O(N × M) |
Where:
- N = Tokens in Source File
- M = Tokens in Target File
Potential extensions:
- AST-based similarity analysis
- Cross-language plagiarism detection
- PDF report generation
- Similarity heatmaps
- Parallel batch processing
- Web dashboard
- Machine Learning based plagiarism classification
This project is licensed under the MIT License.
See the LICENSE file for details.
Satvik Jambagi
- LLVM Project
- Clang Team
- CMake
- GNU GCC
- Open Source Community
Built with C++17, LLVM and a passion for detecting code plagiarism.