This project was developed as part of a course at the Technical University of Denmark (DTU).
The repository contains a CLI tool for benchmarking and comparing different methods of measuring source code similarity (similarity tools), primarily based on Normalized Compression Distance (NCD) and Inclusion Compression Divergence (ICD) as described in the paper A comparison of code similarity analysers by Ragkhitwetsagul et al.
The CLI tool has the following features:
-
Creating heatmaps showing the pairwise similarity scores between Java source code files from the Project CodeNet Java250 dataset.
Created withpy src/main.py 30 30 -c bzip2 -NCD -PH. -
Plotting F-scores of different similarity tools in a comparative line chart.
Created withpy src/main.py 4 300 -c bzip2 zstd zstandard zlib gzip -NCD -ICD -PF.
- python version >= 3.12
NOTE: It may be possible to use older versions of python, but this was the version used during development.
To use the tool contained in this repository, follow this setup:
- Install the required python libraries specified in the requirements.txt file, e.g.
pip install -r requirements.txt - Download the Project CodeNet Java250 dataset and place the unzipped Project_CodeNet_Java250 folder in the root of this project, i.e. at the same level as this README.
The CLI tool is envoked by executing main.py. The following options are available:
| Argument | Type | Explanation |
|---|---|---|
| Positional (required) | - | - |
num_dirs |
int |
Number of directories to process from the dataset. |
num_files |
int |
Number of files to process in each directory. |
| Options | - | - |
-c, --compressors |
Multi-choice | Specify compressor(s). Options: [bzip2, gzip, zlib, zstandard, zstd]. |
| Flags | - | - |
-NCD |
Flag | Use Normalized Compression Distance (NCD) for similarity calculation. |
-ICD |
Flag | Use Inclusion Compression Divergence (ICD) for similarity calculation. |
-PH, --plot-heatmap |
Flag | Generate heatmaps of the similarity matrices. |
-PF, --plot-fscores |
Flag | Plot F-scores for the similarity tools. |
-CL, --cluster, --no-cluster |
Flag | Enable/Disable clustering of the similarity matrices (default: Enabled). |
-h, --help |
Flag | Show this help message and exit. |
For example, creating heatmaps for the first 10 files in the first 5 directories of the Java250 dataset using NCD-based similarity tool with bzip2 and zstd:
py src/main.py 5 10 -NCD -C bzip2 zstd -PH-
main.py: Main entry point of the tool. -
cli.py: -
classification.py -
data.py -
logging.py -
minify.py -
similarity.py -
plots.py -
tools/compressor.py -
tools/plagiarism.py
Use pipreqs. Docs: https://github.com/bndr/pipreqs
# Install pipreqs
pip install pipreqs
# Create requirements.txt
pipreqs .\src --force --savepath requirements.txt