This repository contains the code and GPT annotations for the CLTL's submission (team LotusOrchid) to the NER subtask of the MultiClinAI shared task.
curl https://zenodo.org/records/18772832/files/MultiClinAI-training_data_v1.1-260225.zip?download=1 -o data/MultiClinAI-training_data_v1.1-260225.zip
curl https://zenodo.org/records/19098018/files/MultiClinAI-training+NER_test_bg_v1.2_260318.zip?download=1 -o data/MultiClinAI-training+NER_test_bg_v1.2_260318.zip
cd data
unzip MultiClinAI*.zip
unzip ann_gpt.zip
This project uses uv.
Run scripts with:
uv run script.py
Extract untokenized datasets from annotations:
uv run src/preprocess.py
(see ./dvc.yaml for example calls)
Or extract all datasets with DVC
dvc repro
The main script for finetuning is ./src/main.py. See fitting configuration files in ./cfg/fitting*.yaml, and ./scripts/finetune*.sh for example calls.
The main script for training is ./src/main.py. The main difference with finetuning is that we save a checkpoint at the end. See training configuration files in ./cfg/training*.yaml, and ./scripts/train*.sh for example calls.
The main script for predicting is ./src/predict.py. See ./cfg/predict*.example for configuration files, and ./scripts/predict*.sh for example calls.