Skip to content

FahadEbrahim/SCPC

Repository files navigation

Classification of Source Code Plagiarism Instances

This repository is the source code and results of the paper: "The impact of pre-trained models and adapters for the classification of source code plagiarism instances"

Requirements

The requirements and packages are listed in requirements.txt, which can be installed using pip install -r requirements.txt.

Dataset

The datasets are available in the dataset folder. It has the following datasets: irplag, conplag1 (raw), conplag2 (template-free) and Progpedia19. The files are in .csv format. All credits go to the authors of the datasets.

Running Scripts

The main code is available at run_scpd.py. To run the Python script:

A sample run:

python run_scpd.py --epoch 10 --seed 42 --batch_size 32 --max_seq_length 512 --dataset conplag1 --learning_rate 5e-4 --model_name codeberta --tune_type peft --adapter_type lora --report wandb

The arguments are:

  • Epoch: The number of epochs. The default is 30.
  • seed: For randomization. The default is 42.
  • batch_size: The batch size. The default is 16.
  • max_seq_length: The maximum sequence length for the tokenizer. The default is 512.
  • dataset_version: Whether it's 1 or 2. The default is 1.
  • learning_rate: The learning rate to be used. The default is 5e-4
  • model_name: The model name to be selected. Choices are: [ "codebert", "graphcodebert", "unixcoder", "codeberta", "codet5", "plbart" ]
  • tune_type: Whether to fft or peft.
  • adapter_type: The adapter to be trained and merged. Choices are: [ "houlsby", "pfeiffer", "lora"]
  • report: The default is none. Another choice is the wandb. Other parameters are available on the parse_arguments function within the run_scpd.py file.

There are two other scripts:

  1. run_all_models.sh: Runs the Full Fine-Tuning (FFT) on all models for all the datasets.
  2. run_all_adapters.sh: Runs all adapters on all models for all datasets.

In case of using wandb for experiment tracking, set "report" to "wandb": The Python file requires the API key to be available in the wandb_api.key file to be available in the root directory.

Acknowledgement

This repository used codes and intuitions presented in the following repositories:

  1. Adapters
  2. PEFT
  3. FineTuner

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors