Welcome to the G-SciEdBERT repository! This project contains Jupyter notebooks for pre-training and fine-tuning the G-SciEdBERT model, an innovative large language model designed specifically for scoring German-written responses to science tasks.
G-SciEdBERT is an advanced contextualized large language model (LLM) developed to automate the scoring of German science responses. Traditional models like German BERT (G-BERT) face challenges in contextual understanding, especially when dealing with complex scientific content. G-SciEdBERT addresses these limitations by incorporating domain-specific knowledge and contextual learning principles into its training process.
- Contextual Pre-training: G-SciEdBERT is pre-trained on a corpus of 30,000 German-written science responses from the Programme for International Student Assessment (PISA) 2018, encompassing over 3 million tokens.
- Fine-tuning on Science Tasks: The model is fine-tuned using 27 assessment items from PISA 2015 to enhance its scoring accuracy, achieving an 8.5% improvement in quadratic weighted Kappa over G-BERT.
- Enhanced Scoring Accuracy: By specializing in German science education, G-SciEdBERT significantly outperforms general-purpose models, making it a valuable tool for automated educational assessments.
To get started with the notebooks provided in this repository, follow the instructions below:
Ensure you have the following installed:
- Python 3.7 or higher
- Jupyter Notebook
- Cuda
- PyTorch
- Transformers (HuggingFace library)
-
Clone the Repository:
git clone https://github.com/your-repo/G-SciEdBERT.git cd G-SciEdBERT -
Install Dependencies:
pip install -r requirements.txt
-
Run Jupyter Notebook:
jupyter notebook
-
Pre_train_German_BERT.ipynb: This notebook details the pre-training process of G-SciEdBERT using the PISA 2018 dataset. It includes data preprocessing, model training, and evaluation steps.
-
Pre_train_German_BERT_torch.ipynb: Similar to the first notebook but utilizes PyTorch for the pre-training process, showcasing an alternative implementation.
-
German_BERT_MultiClass_Classification.ipynb: This notebook demonstrates the fine-tuning process of G-SciEdBERT on the PISA 2015 dataset, detailing the steps to prepare the data, train the model, and evaluate its performance.
The G-SciEdBERT model is deployed on HuggingFace and can be accessed via the following link: G-SciEdBERT on HuggingFace
If you use this model or code in your research, please cite our paper:
@article{Latif2023GSciEdBERT,
title={G-SciEdBERT: A Contextualized LLM for Science Assessment Tasks in German},
author={Latif, Ehsan and Lee, Gyeong-Geon and Neuman, Knut and Kastorff, Tamara and Zhai, Xiaoming},
journal={arXiv preprint arXiv:2402.06584},
year={2024}
}
Thank you for using G-SciEdBERT! If you have any questions or feedback, please feel free to open an issue in this repository.