HantaBERT Data Pipeline

license

apache-2.0

pretty_name

HantaBERT Orthohantavirus Genomic Dataset

language

en

tags

biology

genomics

virology

hantavirus

orthohantavirus

nucleotide-sequence

rna

ncbi

genbank

task_categories

text-classification

size_categories

1K<n<10K

annotations_creators

machine-generated

language_creators

found

source_datasets

original

configs

config_name

data_files

default

split	path
train	data/processed/final_hantavirus_dataset.csv

config_name

data_files

raw

split	path
train	data/raw/raw_hantavirus_ncbi.csv

config_name

data_files

interim

split	path
train	data/interim/interim_hantavirus.csv

dataset_info

config_name

features

splits

default

name	dtype
accession_id	string

name	dtype
species_label	string

name	dtype
raw_host	string

name	dtype
lokasi_geografis_name	string

name	dtype
segment_type	string

name	dtype
sequence_length	int64

name	dtype
sequence	string

name

dtype

host_label

class_label

names

0	1	2	3
Rodent	Human	Others	Unknown

name

dtype

geo_label_broad

class_label

names

0	1	2	3	4
Others	Europe	Americas	Asia	Unknown

name	dtype
lokasi_geografis_koordinat	string

name	num_examples
train	9846

config_name

features

splits

raw

name	dtype
accession_id	string

name	dtype
species_label	string

name	dtype
raw_host	string

name	dtype
lokasi_geografis_name	string

name	dtype
segment_type	string

name	dtype
sequence_length	int64

name	dtype
sequence	string

name	num_examples
train	9950

config_name

features

splits

interim

name	dtype
accession_id	string

name	dtype
species_label	string

name	dtype
raw_host	string

name	dtype
lokasi_geografis_name	string

name	dtype
segment_type	string

name	dtype
sequence_length	int64

name	dtype
sequence	string

name	dtype
host_label	string

name	dtype
geo_label_broad	string

name	num_examples
train	9846

HantaBERT Data Pipeline

This repository is responsible for the entire process of collecting, cleaning, and standardizing Orthohantavirus genomic data for the HantaBERT project. The pipeline automates data extraction from NCBI GenBank to produce a ready-to-use dataset for machine learning.

Key Features

Extraction Automation: Uses Biopython to fetch thousands of RNA sequences (S, M, L) and related metadata in batches from the NCBI database.
Multi-task Labeling: Standardizes host, species, and geography labels from unstructured raw data.
Geocoding: Integrates with the Nominatim API to convert country names into geographical coordinates.
Quality Control: Filters sequences based on minimum length and metadata completeness.

Data Structure

The final output is final_hantavirus_dataset.csv, which includes:

accession_id: Unique NCBI ID.
host_label: Host classification (Human, Rodent, Others).
geo_label_broad: Regional classification (Americas, Europe, Asia).
sequence: Pure RNA nucleotide sequence.
lokasi_geografis_koordinat: Lat/Lon points for map visualization.

Preparation & Installation

Use Python 3.12+

python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

How to Run the Pipeline

Run the scripts sequentially according to the data dependency flow:

python src/01_fetch_ncbi.py: Fetches raw data.
python src/02_clean_labels.py: Cleans data and generates labels.
python src/03_geocoding.py: Determines location coordinates.

Open notebooks/eda_hantavirus.ipynb to view the data distribution analysis.

Dataset Card

Configurations

The Hub exposes three configurations matching the pipeline stages:

Config	File	Rows	Description
`default`	`data/processed/final_hantavirus_dataset.csv`	9,846	Fully cleaned, labeled, and geocoded data (recommended).
`raw`	`data/raw/raw_hantavirus_ncbi.csv`	9,950	Unprocessed records as fetched from NCBI GenBank.
`interim`	`data/interim/interim_hantavirus.csv`	9,846	Cleaned and labeled, prior to geocoding.

Usage

from datasets import load_dataset

# Recommended processed split
ds = load_dataset("<namespace>/<dataset-name>")  # config "default"

# Or load a specific pipeline stage
raw = load_dataset("<namespace>/<dataset-name>", "raw")
interim = load_dataset("<namespace>/<dataset-name>", "interim")

Data Fields

accession_id (string): Unique NCBI GenBank accession ID.
species_label (string): Orthohantavirus species (e.g. Orthohantavirus andesense).
raw_host (string): Original host annotation from the source record (e.g. Homo sapiens).
lokasi_geografis_name (string): Raw strain/location string from the source record.
segment_type (string): RNA genome segment — S, M, or L.
sequence_length (int64): Nucleotide length of the sequence.
sequence (string): Nucleotide sequence (A/C/G/T).
host_label (string, default/interim only): Standardized host class — Human, Rodent, Others, Unknown.
geo_label_broad (string, default/interim only): Broad region — Americas, Europe, Asia, Others, Unknown.
lokasi_geografis_koordinat (string, default only): lat, lon coordinates from Nominatim geocoding (empty when unresolved).

Label Distribution (`default`, n=9,846)

host_label: Rodent 6,451 · Human 1,495 · Others 1,321 · Unknown 579
geo_label_broad: Others 9,505 · Europe 255 · Unknown 51 · Americas 22 · Asia 13
segment_type: S 4,175 · L 2,931 · M 2,729 (plus a few unnormalized variants)
coordinates resolved: 2,535 / 9,846 rows

Source & Curation

Sequences and metadata are fetched from NCBI GenBank via Biopython, then standardized (host/species/geography labels), quality-filtered (minimum length and metadata completeness), and geocoded with the Nominatim API. See src/ for the reproducible pipeline.

Considerations

Labels are machine-generated heuristics from unstructured source metadata and may contain noise.
Classes are imbalanced (notably geo_label_broad, dominated by Others); account for this when training.
The underlying records originate from NCBI GenBank and are subject to GenBank's terms of use.

Licensing

This dataset and pipeline are released under the Apache-2.0 license.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data		data
notebooks		notebooks
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HantaBERT Data Pipeline

Key Features

Data Structure

Preparation & Installation

How to Run the Pipeline

Dataset Card

Configurations

Usage

Data Fields

Label Distribution (`default`, n=9,846)

Source & Curation

Considerations

Licensing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

HantaBERT Data Pipeline

Key Features

Data Structure

Preparation & Installation

How to Run the Pipeline

Dataset Card

Configurations

Usage

Data Fields

Label Distribution (default, n=9,846)

Source & Curation

Considerations

Licensing

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Label Distribution (`default`, n=9,846)

Packages