Skip to content

BilboBackend/CodeSimilarity

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Code Similarity

This project was developed as part of a course at the Technical University of Denmark (DTU).

The repository contains a CLI tool for benchmarking and comparing different methods of measuring source code similarity (similarity tools), primarily based on Normalized Compression Distance (NCD) and Inclusion Compression Divergence (ICD) as described in the paper A comparison of code similarity analysers by Ragkhitwetsagul et al.

The CLI tool has the following features:

  1. Creating heatmaps showing the pairwise similarity scores between Java source code files from the Project CodeNet Java250 dataset.
    Similarity heatmap example
    Created with py src/main.py 30 30 -c bzip2 -NCD -PH.

  2. Plotting F-scores of different similarity tools in a comparative line chart.
    F-scores example
    Created with py src/main.py 4 300 -c bzip2 zstd zstandard zlib gzip -NCD -ICD -PF.

Requirements

  • python version >= 3.12
    NOTE: It may be possible to use older versions of python, but this was the version used during development.

Setup

To use the tool contained in this repository, follow this setup:

  1. Install the required python libraries specified in the requirements.txt file, e.g.
    pip install -r requirements.txt
  2. Download the Project CodeNet Java250 dataset and place the unzipped Project_CodeNet_Java250 folder in the root of this project, i.e. at the same level as this README.

Usage

The CLI tool is envoked by executing main.py. The following options are available:

Argument Type Explanation
Positional (required) - -
num_dirs int Number of directories to process from the dataset.
num_files int Number of files to process in each directory.
Options - -
-c, --compressors Multi-choice Specify compressor(s). Options: [bzip2, gzip, zlib, zstandard, zstd].
Flags - -
-NCD Flag Use Normalized Compression Distance (NCD) for similarity calculation.
-ICD Flag Use Inclusion Compression Divergence (ICD) for similarity calculation.
-PH, --plot-heatmap Flag Generate heatmaps of the similarity matrices.
-PF, --plot-fscores Flag Plot F-scores for the similarity tools.
-CL, --cluster, --no-cluster Flag Enable/Disable clustering of the similarity matrices (default: Enabled).
-h, --help Flag Show this help message and exit.

For example, creating heatmaps for the first 10 files in the first 5 directories of the Java250 dataset using NCD-based similarity tool with bzip2 and zstd:

py src/main.py 5 10 -NCD -C bzip2 zstd -PH

Development

Source files and contents

  • main.py: Main entry point of the tool.

  • cli.py:

  • classification.py

  • data.py

  • logging.py

  • minify.py

  • similarity.py

  • plots.py

  • tools/compressor.py

  • tools/plagiarism.py

Creating requirements.txt

Use pipreqs. Docs: https://github.com/bndr/pipreqs

# Install pipreqs 
pip install pipreqs

# Create requirements.txt
pipreqs .\src --force --savepath requirements.txt

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 100.0%