CopyCat Crusher

Advanced Source Code Plagiarism Detection Engine using Clang Tokenization, Rolling Hashing, Winnowing Fingerprinting, Suffix Automata and Levenshtein Similarity.

Overview

CopyCat Crusher is a high-performance source code plagiarism detection system designed to identify copied, modified, and structurally similar programs.

Unlike simple text comparison tools, CopyCat Crusher analyzes normalized source code structures and combines multiple plagiarism detection techniques to detect similarities even when variable names, literals, formatting, or comments have been changed.

The system is particularly useful for:

Academic plagiarism detection
Programming assignment evaluation
Coding competition integrity checks
Source code similarity analysis
Software audit and review systems

Problem Statement

Traditional plagiarism detectors often fail when students perform simple obfuscation techniques such as:

Renaming variables
Renaming functions
Changing numeric constants
Modifying formatting
Rearranging comments
Altering whitespace

Example:

Submission A

int sum = 0;

for(int i=0;i<100;i++){
    sum += i;
}

Submission B

int result = 0;

for(int index=0;index<100;index++){
    result += index;
}

Although these programs are nearly identical, a naïve text comparison system may treat them as different.

CopyCat Crusher normalizes source code before analysis and can identify such similarities effectively.

Features

Clang-based source code tokenization
Language-aware token normalization
Rolling Hash based sequence matching
Winnowing Fingerprinting algorithm
Suffix Automaton based longest common block detection
Levenshtein Distance similarity scoring
Weighted similarity fusion engine
Automated batch analysis
HTML report generation
Multi-file comparison support
C++17 implementation
LLVM/libclang integration

System Architecture

Source Files
      │
      ▼
Clang Tokenizer
      │
      ▼
Token Normalizer
      │
      ▼
─────────────────────────────
│ Rolling Hash             │
│ Winnowing Fingerprinting │
│ Suffix Automaton         │
│ Levenshtein Similarity   │
─────────────────────────────
      │
      ▼
Similarity Fusion Engine
      │
      ▼
HTML Report Generator
      │
      ▼
Final Plagiarism Report

Core Algorithms

1. Clang Tokenization

The project uses LLVM's libclang API to generate language-aware tokens from source code.

Benefits:

Ignores formatting differences
Understands C++ syntax
Produces reliable token streams
Handles complex language constructs

2. Token Normalization

Identifiers and literals are normalized before comparison.

Example:

Original:

int marks = 95;

Normalized:

TYPE VAR = NUM_LITERAL ;

This removes superficial differences and focuses on program structure.

3. Rolling Hash

Rolling Hash is used to efficiently generate hash values for token windows.

Benefits:

Fast similarity detection
O(N) processing
Supports large submissions

4. Winnowing Fingerprinting

The Winnowing algorithm selects representative fingerprints from rolling hashes.

Benefits:

Noise reduction
Efficient storage
Robust plagiarism detection

5. Suffix Automaton

Used to detect the longest common token block shared between two submissions.

Benefits:

Detects copied code segments
Linear-time processing
Effective against partial plagiarism

6. Levenshtein Similarity

Measures edit distance between normalized token sequences.

Benefits:

Detects near matches
Handles minor modifications
Captures structural similarity

Similarity Score Calculation

The final plagiarism score is computed using a weighted ensemble of multiple algorithms.

Final Score =
    Exact Match Score
  + Fingerprint Score
  + Longest Match Score
  + Levenshtein Score

The combined score provides a more reliable plagiarism estimate than any individual technique.

Project Structure

CopyCatCrusher/
│
├── include/
│   ├── algorithms/
│   ├── core/
│   ├── models/
│   ├── normalizer/
│   ├── report/
│   ├── similarity/
│   └── tokenizer/
│
├── src/
│   ├── algorithms/
│   ├── core/
│   ├── normalizer/
│   ├── report/
│   ├── similarity/
│   ├── tokenizer/
│   └── main.cpp
│
├── submissions/
├── reports/
│
├── CMakeLists.txt
├── LICENSE
└── README.md

Installation

Prerequisites

C++17 Compatible Compiler
CMake 3.20+
LLVM
libclang

Clone Repository

git clone https://github.com/7vik2005/CopyCat-Crushers.git

cd CopyCat-Crushers

Build Instructions

Windows (MSYS2 UCRT64)

cmake -G "MinGW Makefiles" -S . -B build

cmake --build build

Linux

mkdir build

cd build

cmake ..

make

Usage

Step 1

Place all source files inside:

submissions/

Example:

submissions/
├── student1.cpp
├── student2.cpp
├── student3.cpp

Step 2

Run:

./build/CopyCatCrusher.exe submissions reports

Step 3

View Results

Console Output:

============================================================
                CopyCat Crusher Results
============================================================

Comparisons Generated: 3

Source                        Target                        Score (%)

student1.cpp                  student2.cpp                  100.00
student1.cpp                  student3.cpp                  81.62
student2.cpp                  student3.cpp                  81.62

HTML Reports

For every comparison, an HTML report is automatically generated.

Example:

reports/
├── comparison_0.html
├── comparison_1.html
├── comparison_2.html

Each report contains:

Similarity Score
Risk Level
Exact Match Score
Fingerprint Similarity
Longest Common Block
Levenshtein Similarity
Token Statistics

Risk Levels

Score	Risk Level
0% - 30%	Low
30% - 60%	Medium
60% - 80%	High
80% - 100%	Critical

Performance

Typical Complexity:

Component	Complexity
Tokenization	O(N)
Normalization	O(N)
Rolling Hash	O(N)
Winnowing	O(N)
Suffix Automaton	O(N)
Levenshtein	O(N × M)

Where:

N = Tokens in Source File
M = Tokens in Target File

Future Improvements

Potential extensions:

AST-based similarity analysis
Cross-language plagiarism detection
PDF report generation
Similarity heatmaps
Parallel batch processing
Web dashboard
Machine Learning based plagiarism classification

License

This project is licensed under the MIT License.

See the LICENSE file for details.

Author

Satvik Jambagi

Acknowledgements

LLVM Project
Clang Team
CMake
GNU GCC
Open Source Community

Built with C++17, LLVM and a passion for detecting code plagiarism.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
include		include
src		src
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
debug.txt		debug.txt

Folders and files

Latest commit

History

Repository files navigation

CopyCat Crusher

Overview

Problem Statement

Submission A

Submission B

Features

System Architecture

Core Algorithms

1. Clang Tokenization

2. Token Normalization

3. Rolling Hash

4. Winnowing Fingerprinting

5. Suffix Automaton

6. Levenshtein Similarity

Similarity Score Calculation

Project Structure

Installation

Prerequisites

Clone Repository

Build Instructions

Windows (MSYS2 UCRT64)

Linux

Usage

Step 1

Step 2

Step 3

HTML Reports

Risk Levels

Performance

Future Improvements

License

Author

Acknowledgements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages