Skip to content

[Question] Ensemble approach using multiple runs from checkpoint files #19

@Khreat0205

Description

@Khreat0205

Background

I'm currently using scTab for cell type annotation on my research dataset with the checkpoint file cap-sctab-service-ckpts.tar.gz downloaded from here.

Current Approach

The checkpoint appears to contain 5 runs with different random seeds. I'm implementing an ensemble approach as follows:

  1. Prediction aggregation: Use all 5 runs to predict cell types for each cell
  2. Label assignment: Assign the most frequent (mode) predicted label across the 5 runs
  3. Tie-breaking: If there's a tie in the mode calculation, compute the average prediction probability for each tied label and assign the label with higher average probability
  4. Confidence scoring: Use the average probability from runs that predicted the assigned label as the final confidence score
  5. Filtering: Plan to filter cells based on this confidence score

Questions

  1. Is this ensemble approach appropriate and recommended? Or would it be better to use a specific single run from the checkpoint?
  2. Are there any best practices or recommendations for handling multiple runs in scTab checkpoints?
  3. Is the tie-breaking method reasonable, or should I consider alternative approaches (e.g., using prediction entropy, maximum probability across runs, etc.)?
  4. For confidence-based filtering, what threshold ranges have you found effective in practice?

Any guidance on the optimal strategy for utilizing multiple runs would be greatly appreciated!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions