Accepted by NAACL 2024 Main Conference (Oral Presentation), Named Entity Recognition Under Domain Shift via Metric Learning for Life Sciences
-
Python 3.8 (Ubuntu 20.04)
-
PyTorch 1.11.0 & CUDA 11.3
Here is some basic steps to setup the environment.
Step1: Create an unique Conda environment and install Python and PyTorch with CUDA support of specified version.
conda create -n [ENV_NAME] python=3.8
conda install pytorch torchvision torchaudio pytorch-cuda=11.3 -c pytorch -c nvidiaStep2: Install all the required Python packages for the repository by the following command:
pip install -r requirements.txtStep3: Install NLTK data. Run the Python interpreter and type the following commands:
>>> import nltk
>>> nltk.download("punkt")All the datasets involved have been uploaded to Huggingface Lhtie/Bio-Domain-Transfer. Download the datasets by typing the commands:
git lfs install
git clone https://huggingface.co/datasets/Lhtie/Bio-Domain-TransferThe folder contains biomedical datasets PathwayCuration, Cancer Genetics ,Infectious Diseases and chemical datasets CHEMDNER, BC5CDR, DrugProt.
All the models used (BERT, SapBERT, S-PubMedBert-MS-MARCO-SCIFACT) can be download from from Huggingface Repositories:
git lfs install
git clone https://huggingface.co/bert-base-uncased
git clone https://huggingface.co/cambridgeltl/SapBERT-from-PubMedBERT-fulltext
git clone https://huggingface.co/pritamdeka/S-PubMedBert-MS-MARCO-SCIFACT-
dataConfigcontains data process scriptsDataConfig: Modify
dataset_dirfromdataConfig/config.py: directory path to datasets (eg../Bio-Domain-Transfer)ModelConfig: Modify
sapbert_path,sentbert_path,bert_pathfromdataConfig/confg.py: directory path to models respectively -
configs/paracontains configuration files for different experiment senariosfew-shot_bert.yaml: Target Onlyoracle_bert.yaml: Target Only with full training datatransfer_learning.yaml: Direct Transfertransfer_learning_eg.yaml: EG (Fill inDATA.BIOMEDICAL.SIM_METHODto switch betweenconcatandsentEnc)transfer_learning_disc.yaml: EDtransfer_learning_eg_disc.yaml: EG+ED
Train
Run the train.py script (Multi-Processing) by the following command:
torchrun --nnodes=1 --nproc_per_node=<# gpus> train.py \
--cfg_file <configuration file> \Test
Run the eval.py script to test finetuned models:
python eval.py --cfg_file <configuration file>@inproceedings{liu-etal-2024-named,
title = "Named Entity Recognition Under Domain Shift via Metric Learning for Life Sciences",
author = "Liu, Hongyi and
Wang, Qingyun and
Karisani, Payam and
Ji, Heng",
editor = "Duh, Kevin and
Gomez, Helena and
Bethard, Steven",
booktitle = "Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)",
month = jun,
year = "2024",
address = "Mexico City, Mexico",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.naacl-long.1",
pages = "1--21",
}
