Leveraging Large Language Models for Scientific Text Summarization: Fine-Tuning LED and Performance Evaluation
This repository contains the implementation for an INFO7016 Postgraduate Project B research project on scientific text summarization using the Longformer Encoder-Decoder (LED) model. The project evaluates whether parameter-efficient fine-tuning using Quantized Low-Rank Adaptation (QLoRA) improves the summarization performance of a base LED model on scientific papers.
The project compares two model settings:
- Base LED model without fine-tuning
- QLoRA fine-tuned LED model
Both models are evaluated on the same fixed test dataset of 370 cleaned arXiv papers. The generated summaries are compared against the original abstracts using ROUGE-L, BERTScore F1, and a proposed sentence-level evaluation metric called the Hungarian Summary Similarity Metric (HSSM).
The main aim of this project is to evaluate whether QLoRA fine-tuning improves the scientific summarization performance of a base LED model.
The project also proposes HSSM as a complementary sentence-level evaluation metric for comparing generated summaries with reference abstracts.
This project makes three main contributions:
- A controlled before-and-after comparison of a base LED model and a QLoRA fine-tuned LED model for scientific text summarization.
- The implementation of the Hungarian Summary Similarity Metric (HSSM), a sentence-level metric based on sentence embeddings, cosine similarity, and Hungarian one-to-one alignment.
- A multi-metric evaluation framework using ROUGE-L, BERTScore F1, HSSM, Wilcoxon signed-rank significance testing, and inter-metric correlation analysis.
The QLoRA fine-tuned LED model achieved higher average scores than the base LED model across all three evaluation metrics.
| Metric | Base LED | LED + QLoRA |
|---|---|---|
| ROUGE-L | 0.1525 | 0.1780 |
| BERTScore F1 | 0.6831 | 0.7172 |
| HSSM | 0.5454 | 0.5820 |
A paired Wilcoxon signed-rank test showed that the improvements were statistically significant across ROUGE-L, BERTScore F1, and HSSM.
scientific_text_summarization_llms/
│
├── README.md
├── requirements.txt
├── .gitignore
│
├── Model_Finetune/
│ └── src/
│ ├── download_arxiv_papers.py
│ ├── download_fixed_papers.py
│ ├── create_fixed_paper_list.py
│ ├── pymupdf4llm_extraction.py
│ ├── filter_json_to_standard_dataset.py
│ ├── build_test_dataset.py
│ ├── prepare_train_data.py
│ ├── download_led_base_model.py
│ ├── train_led_qlora_test.py
│ ├── train_led_qlora_full.py
│ ├── run_base_led_inference_test.py
│ ├── run_base_led_inference_full.py
│ ├── run_finetuned_led_inference_test.py
│ ├── run_finetuned_led_inference_full.py
│ ├── evaluate_predictions.py
│ ├── run_significance_tests.py
│ ├── analyse_metric_correlations.py
│ └── build_evaluation_table.py
│
└── hss_metric/
├── README.md
└── src/
└── hss_metric.py
The following Large files are intentionally excluded from this GitHub repository using .gitignore. :
- downloaded arXiv PDFs
- processed datasets
- Hugging Face model files
- trained QLoRA adapters
- generated prediction files
- evaluation outputs and charts
- virtual environment files
The project follows a controlled before-and-after experimental design.
First, the base LED model generates summaries for the fixed 370-paper test dataset created for this project.The 370-paper test dataset was created by selecting fixed arXiv papers, downloading their PDFs, extracting the title, introduction, conclusion, and abstract, and then cleaning the extracted text for evaluation.Then, the same base LED model is fine-tuned using QLoRA on a separate 5,000-record arXiv summarization training dataset sourced from the Hugging Face ccdv/arxiv-summarization dataset. After fine-tuning, the QLoRA-adapted LED model generates summaries for the same 370 test papers.
Both models use the same input format and generation settings to ensure a fair comparison.
The evaluation input consists of:
title + introduction + conclusion
The original abstract is held out and used as the reference summary.
The base model used in this project is:
allenai/led-base-16384
QLoRA was used for parameter-efficient fine-tuning. During fine-tuning, only 589,824 parameters were trainable out of 162,434,304 total parameters, representing approximately 0.3631% of the model parameters.
Both the base LED model and the QLoRA fine-tuned LED model used the same decoding settings:
maximum input length: 4096 tokens
beam size: 4
minimum summary length: 50 tokens
maximum summary length: 180 tokens
no_repeat_ngram_size: 4
Three evaluation metrics were used:
ROUGE-L measures lexical overlap using the longest common subsequence between the generated summary and the reference abstract.
BERTScore measures semantic similarity using contextual token embeddings.
HSSM is a proposed sentence-level evaluation metric. It works by:
- Splitting the generated summary and reference abstract into sentences
- Encoding each sentence using sentence-transformer embeddings
- Computing cosine similarity between generated-summary sentences and reference-abstract sentences
- Applying the Hungarian algorithm to find the best one-to-one sentence alignment optimal match.
- Averaging the cosine similarity scores of the matched sentence pairs
HSSM is not intended to replace already existing ROUGE-L or BERTScore. It is used as a complementary metric for analysing sentence-level semantic alignment.
git clone https://github.com/ikhimwinemmanuel/scientific_text_summarization_llms.git
cd scientific_text_summarization_llmspython -m venv venv
venv\Scripts\activateFor Linux or HPC environments:
python -m venv venv
source venv/bin/activatepip install -r requirements.txtThe data preparation scripts are located in:
Model_Finetune/src/
The main scripts used for data preparation include:
python Model_Finetune/src/download_arxiv_papers.py
python Model_Finetune/src/pymupdf4llm_extraction.py
python Model_Finetune/src/filter_json_to_standard_dataset.py
python Model_Finetune/src/build_test_dataset.py
python Model_Finetune/src/prepare_train_data.pypython Model_Finetune/src/download_led_base_model.pypython Model_Finetune/src/train_led_qlora_test.pypython Model_Finetune/src/train_led_qlora_full.pypython Model_Finetune/src/run_base_led_inference_full.pypython Model_Finetune/src/run_finetuned_led_inference_full.pypython Model_Finetune/src/evaluate_predictions.py
python Model_Finetune/src/run_significance_tests.py
python Model_Finetune/src/analyse_metric_correlations.py
python Model_Finetune/src/build_evaluation_table.pyThe full experiments were run on Western Sydney University Wolffe High Performance Computing environment using an NVIDIA A30 GPU through the ampere24 partition. GPU jobs were submitted through Slurm Scheduler.
- Python
- PyTorch
- Hugging Face Transformers
- Hugging Face Datasets
- PEFT
- bitsandbytes
- Sentence-Transformers
- SciPy
- scikit-learn
- pandas
- NumPy
- matplotlib
- rouge-score
- bert-score
- Git and GitHub
- Slurm
- Wolffe HPC
The project implementation is complete. The repository contains the main scripts used for data preparation, QLoRA fine-tuning, inference, evaluation, statistical significance testing, and HSSM analysis.
- Ikhimwin Osakpamwan Emmanuel
- Master of Artificial Intelligence
- Western Sydney University