This repository is the source code and results of the paper: "The impact of pre-trained models and adapters for the classification of source code plagiarism instances"
The requirements and packages are listed in requirements.txt, which can be installed using pip install -r requirements.txt.
The datasets are available in the dataset folder. It has the following datasets: irplag, conplag1 (raw), conplag2 (template-free) and Progpedia19. The files are in .csv format.
All credits go to the authors of the datasets.
The main code is available at run_scpd.py. To run the Python script:
A sample run:
python run_scpd.py --epoch 10 --seed 42 --batch_size 32 --max_seq_length 512 --dataset conplag1 --learning_rate 5e-4 --model_name codeberta --tune_type peft --adapter_type lora --report wandbThe arguments are:
- Epoch: The number of epochs. The default is 30.
- seed: For randomization. The default is 42.
- batch_size: The batch size. The default is 16.
- max_seq_length: The maximum sequence length for the tokenizer. The default is 512.
- dataset_version: Whether it's 1 or 2. The default is 1.
- learning_rate: The learning rate to be used. The default is 5e-4
- model_name: The model name to be selected. Choices are: [ "codebert", "graphcodebert", "unixcoder", "codeberta", "codet5", "plbart" ]
- tune_type: Whether to fft or peft.
- adapter_type: The adapter to be trained and merged. Choices are: [ "houlsby", "pfeiffer", "lora"]
- report: The default is none. Another choice is the wandb.
Other parameters are available on the parse_arguments function within the
run_scpd.pyfile.
There are two other scripts:
run_all_models.sh: Runs the Full Fine-Tuning (FFT) on all models for all the datasets.run_all_adapters.sh: Runs all adapters on all models for all datasets.
In case of using wandb for experiment tracking, set "report" to "wandb":
The Python file requires the API key to be available in the wandb_api.key file to be available in the root directory.
This repository used codes and intuitions presented in the following repositories: