Classification of Source Code Plagiarism Instances

This repository is the source code and results of the paper: "The impact of pre-trained models and adapters for the classification of source code plagiarism instances"

Requirements

The requirements and packages are listed in requirements.txt, which can be installed using pip install -r requirements.txt.

Dataset

The datasets are available in the dataset folder. It has the following datasets: irplag, conplag1 (raw), conplag2 (template-free) and Progpedia19. The files are in .csv format. All credits go to the authors of the datasets.

Running Scripts

The main code is available at run_scpd.py. To run the Python script:

A sample run:

python run_scpd.py --epoch 10 --seed 42 --batch_size 32 --max_seq_length 512 --dataset conplag1 --learning_rate 5e-4 --model_name codeberta --tune_type peft --adapter_type lora --report wandb

The arguments are:

Epoch: The number of epochs. The default is 30.
seed: For randomization. The default is 42.
batch_size: The batch size. The default is 16.
max_seq_length: The maximum sequence length for the tokenizer. The default is 512.
dataset_version: Whether it's 1 or 2. The default is 1.
learning_rate: The learning rate to be used. The default is 5e-4
model_name: The model name to be selected. Choices are: [ "codebert", "graphcodebert", "unixcoder", "codeberta", "codet5", "plbart" ]
tune_type: Whether to fft or peft.
adapter_type: The adapter to be trained and merged. Choices are: [ "houlsby", "pfeiffer", "lora"]
report: The default is none. Another choice is the wandb. Other parameters are available on the parse_arguments function within the run_scpd.py file.

There are two other scripts:

run_all_models.sh: Runs the Full Fine-Tuning (FFT) on all models for all the datasets.
run_all_adapters.sh: Runs all adapters on all models for all datasets.

wandb

In case of using wandb for experiment tracking, set "report" to "wandb": The Python file requires the API key to be available in the wandb_api.key file to be available in the root directory.

Acknowledgement

This repository used codes and intuitions presented in the following repositories:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Classification of Source Code Plagiarism Instances

Requirements

Dataset

Running Scripts

wandb

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
dataset		dataset
README.md		README.md
requirements.txt		requirements.txt
run_all_adapters.sh		run_all_adapters.sh
run_all_models.sh		run_all_models.sh
run_scpd.py		run_scpd.py
run_scpd_kfold.py		run_scpd_kfold.py
wandb_api.key		wandb_api.key

Folders and files

Latest commit

History

Repository files navigation

Classification of Source Code Plagiarism Instances

Requirements

Dataset

Running Scripts

wandb

Acknowledgement

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages