-
Notifications
You must be signed in to change notification settings - Fork 25
Description
Hi CAFE5 developers, I am new to using your software and I am excited by the questions it is allowing me to ask. Thank you for developing!
I have a couple of clarifying questions that I would appreciate getting your input on. Additionally, I wanted to bring a couple of things to your attention.
-
Does CAFE5 work well with gene families that have several species with 0 gene counts or is it best to limit CAFE5 to orthogroups shared by all species? With Orthofinder, I have identified a total of 33,123 Orthogroups but only ~4300 are shared by all my species. These are the gene families I have chosen to include as CAFE's input, but I was wondering if I am loosing a lot of enlightening data by avoiding the appearance of zeros and therefore would like to know if CAFE is robust with 0 gene counts in gene families.
-
On CAFE5's manuscript (Mendés et al. 2021) there are a couple of important observations made regarding K that I wanted to clarify. It is said that K = 1 usually underestimates lambda and that K= 2-3 slightly over-estimates lambda. Is there a good way then of knowing how to select the number for K? I have seen on a couple of discussion posts that there is no clear way to determine which K is best and that the likelihoods of models with differing Ks cannot be compared. However, Fig.1A on the manuscript shows an * on the K number for which the highest likelihood was estimated. Therefore, I am a bit confused. Is the maximum likelihood a good indicator of a proper K? Or how does one determine that an optimal or "good enough" K has been found for the model?
-
Can models with varying lambdas be considered nested? In other words, can I do the likelihood ratio of models with 1-4 lambdas to determine what fits my data best? The species I am comparing are all from the same Class but some diverged over 200 million years ago, so I do not expect the rates to be the same across the entire phylogenetic tree.
4. Would you suggest determining the best K number first or the best lambda number first?
- It is unclear to me how to use the Simulation function of CAFE5 and how it can help me determine which values are best for my model. When I look at the tutorial (http://github.com/hahnlab/CAFE5/blob/master/docs/tutorial/tutorial.md#simulation) , it refers to an example and to commands that are not shown:
"Simulation
Here, the genfamily command simulates the datasets (in the example above, we are asking for 100 simulations with -t 100). It estimates λ from the observed data to simulate gene families. Then the likelihoods of the two competing models are calculated with the lhtest function, which takes the multi-λ tree structure, and the estimated λ value using the global-λ model."
Is there another document I should refer to or do you have any suggestions?
Once again thank you for developing and for all of your help!