You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This repository is responsible for the entire process of collecting, cleaning, and standardizing Orthohantavirus genomic data for the HantaBERT project. The pipeline automates data extraction from NCBI GenBank to produce a ready-to-use dataset for machine learning.
Key Features
Extraction Automation: Uses Biopython to fetch thousands of RNA sequences (S, M, L) and related metadata in batches from the NCBI database.
Multi-task Labeling: Standardizes host, species, and geography labels from unstructured raw data.
Geocoding: Integrates with the Nominatim API to convert country names into geographical coordinates.
Quality Control: Filters sequences based on minimum length and metadata completeness.
Data Structure
The final output is final_hantavirus_dataset.csv, which includes:
geo_label_broad: Others 9,505 · Europe 255 · Unknown 51 · Americas 22 · Asia 13
segment_type: S 4,175 · L 2,931 · M 2,729 (plus a few unnormalized variants)
coordinates resolved: 2,535 / 9,846 rows
Source & Curation
Sequences and metadata are fetched from NCBI GenBank
via Biopython, then standardized (host/species/geography labels), quality-filtered
(minimum length and metadata completeness), and geocoded with the
Nominatim API. See src/ for the reproducible pipeline.
Considerations
Labels are machine-generated heuristics from unstructured source metadata and may contain noise.
Classes are imbalanced (notably geo_label_broad, dominated by Others); account for this when training.
The underlying records originate from NCBI GenBank and are subject to GenBank's terms of use.
Licensing
This dataset and pipeline are released under the Apache-2.0 license.
About
Data collection and preprocessing pipeline for the HantaBERT Orthohantavirus genomic dataset from NCBI GenBank.