Skip to content

GenoML Munging Failure: TypeError when Processing Sample IDs during Imputation #46

@ensiferum877

Description

@ensiferum877

Bug Description

During the discrete supervised munging step using GenoML's example data, the pipeline fails with a TypeError. The error occurs when trying to apply median imputation to what appears to be the sample ID column, which should be handled as identifiers rather than numeric data.

Environment Setup

Using custom Dockerfile based on jupyter/datascience-notebook:

Build and run the Docker container

docker build -t my-datascience-notebook .
docker run -it my-datascience-notebook

Inside the container, clone GenoML repository

git clone https://github.com/GenoML/genoml2.git
cd genoml2

Modified requirements

Had to remove version restriction for xgboost in requirements.txt due to Python version compatibility issues
Original requirement: xgboost==2.0.3
Modified to: xgboost

Install requirements

pip install -r requirements.txt

Using GenoML's example data

genoml discrete supervised munge
--prefix outputs
--geno examples/discrete/training
--pheno examples/discrete/training_pheno.csv
--addit examples/discrete/training_addit.csv

Error message

Pipeline successfully completes PLINK dependency check
Successfully exports genotype data
Completes SNP pruning (12 of 500 variants removed)
Fails during the final data munging step with TypeError

TypeError: Cannot convert [['sample81' 'sample158' 'sample216' ...]] to numeric
The error occurs in the following sequence:
pythonCopyFile ".../genoml/cli/munging.py", line 75
df = munger.plink_inputs()
File ".../genoml/preprocessing/munging.py", line 213
raw_df = raw_df.fillna(raw_df.median())

The pipeline should recognize sample ID columns as identifiers and exclude them from numeric operations like median imputation. This appears to be an issue with column handling during the munging process rather than with the input data format.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions