Skip to content

Model Questions and Missing Information on Tutorial #241

@danielagandrade

Description

@danielagandrade

Hi CAFE5 developers, I am new to using your software and I am excited by the questions it is allowing me to ask. Thank you for developing!

I have a couple of clarifying questions that I would appreciate getting your input on. Additionally, I wanted to bring a couple of things to your attention.

  1. Does CAFE5 work well with gene families that have several species with 0 gene counts or is it best to limit CAFE5 to orthogroups shared by all species? With Orthofinder, I have identified a total of 33,123 Orthogroups but only ~4300 are shared by all my species. These are the gene families I have chosen to include as CAFE's input, but I was wondering if I am loosing a lot of enlightening data by avoiding the appearance of zeros and therefore would like to know if CAFE is robust with 0 gene counts in gene families.

  2. On CAFE5's manuscript (Mendés et al. 2021) there are a couple of important observations made regarding K that I wanted to clarify. It is said that K = 1 usually underestimates lambda and that K= 2-3 slightly over-estimates lambda. Is there a good way then of knowing how to select the number for K? I have seen on a couple of discussion posts that there is no clear way to determine which K is best and that the likelihoods of models with differing Ks cannot be compared. However, Fig.1A on the manuscript shows an * on the K number for which the highest likelihood was estimated. Therefore, I am a bit confused. Is the maximum likelihood a good indicator of a proper K? Or how does one determine that an optimal or "good enough" K has been found for the model?

  3. Can models with varying lambdas be considered nested? In other words, can I do the likelihood ratio of models with 1-4 lambdas to determine what fits my data best? The species I am comparing are all from the same Class but some diverged over 200 million years ago, so I do not expect the rates to be the same across the entire phylogenetic tree.

4. Would you suggest determining the best K number first or the best lambda number first?

  1. It is unclear to me how to use the Simulation function of CAFE5 and how it can help me determine which values are best for my model. When I look at the tutorial (http://github.com/hahnlab/CAFE5/blob/master/docs/tutorial/tutorial.md#simulation) , it refers to an example and to commands that are not shown:

"Simulation
Here, the genfamily command simulates the datasets (in the example above, we are asking for 100 simulations with -t 100). It estimates λ from the observed data to simulate gene families. Then the likelihoods of the two competing models are calculated with the lhtest function, which takes the multi-λ tree structure, and the estimated λ value using the global-λ model."

Is there another document I should refer to or do you have any suggestions?

Once again thank you for developing and for all of your help!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions