Background
I'm currently using scTab for cell type annotation on my research dataset with the checkpoint file cap-sctab-service-ckpts.tar.gz downloaded from here.
Current Approach
The checkpoint appears to contain 5 runs with different random seeds. I'm implementing an ensemble approach as follows:
- Prediction aggregation: Use all 5 runs to predict cell types for each cell
- Label assignment: Assign the most frequent (mode) predicted label across the 5 runs
- Tie-breaking: If there's a tie in the mode calculation, compute the average prediction probability for each tied label and assign the label with higher average probability
- Confidence scoring: Use the average probability from runs that predicted the assigned label as the final confidence score
- Filtering: Plan to filter cells based on this confidence score
Questions
- Is this ensemble approach appropriate and recommended? Or would it be better to use a specific single run from the checkpoint?
- Are there any best practices or recommendations for handling multiple runs in scTab checkpoints?
- Is the tie-breaking method reasonable, or should I consider alternative approaches (e.g., using prediction entropy, maximum probability across runs, etc.)?
- For confidence-based filtering, what threshold ranges have you found effective in practice?
Any guidance on the optimal strategy for utilizing multiple runs would be greatly appreciated!
Background
I'm currently using scTab for cell type annotation on my research dataset with the checkpoint file cap-sctab-service-ckpts.tar.gz downloaded from here.
Current Approach
The checkpoint appears to contain 5 runs with different random seeds. I'm implementing an ensemble approach as follows:
Questions
Any guidance on the optimal strategy for utilizing multiple runs would be greatly appreciated!