Subfamilies as a way to resolve too large families #253

MirkMart · 2025-11-20T20:52:12Z

MirkMart
Nov 20, 2025

Hello CAFE5 people!
As many of you before me I encountered the error "Failed to initialize any reasonable values" because I'm studying a dataset with some orthogroups/gene families that have a too large difference in species occurrences. The perfectly reasonable resolution would be remove such families but, personally speaking, they are particularly interesting in my dataset (I'm studying social behaviour in Hymenoptera and these gene families contain odorant receptor genes).

To tackle this problem without losing useful data, I was wondering if it could be advisabe to further split such families into subfamilies/suborthogroups that could decrease the number of sequences belonging to the most present species. I was thinking do to so either using interproscan, hopefully finding domain patterns that could help me to fragment such big families, or a reciprocal BLASTp, as reported in the pipeline presented in the tutorial of CAFE5, with more or less strict parameters. Personally, I find the former option more suitable as far as it is able to really take apart big families, while the latter seems to me more prone to focus more on point mutations failing to take into consideration the evolutionary history of the family and its components.

Maybe this could be at least a first step toward the solution? Do you think that creating these artificial families I will fail to catch the actual evolutionary signal that I'm trying to study?

What do you think?
Thanks in advance for any suggestion or comment.

Happy coding :)

Answered by hahnlab-user

Nov 21, 2025

Hello,

I think this is a perfectly reasonable thing to do, with one caveat: while this will allow you to better infer ancestral states (and gains and losses per family), I worry about the consequences for inferring overall lambdas. We don't really know what would happen if you cut different trees off at different heights (which is essentially what is happening) for the estimates of lambda. It's kind of like clustering different parts of your dataset differently. It might have no effect or only a subtle effect, but I've never tested it!

matt

View full answer

hahnlab-user · 2025-11-21T15:25:25Z

hahnlab-user
Nov 21, 2025
Maintainer

Hello,

I think this is a perfectly reasonable thing to do, with one caveat: while this will allow you to better infer ancestral states (and gains and losses per family), I worry about the consequences for inferring overall lambdas. We don't really know what would happen if you cut different trees off at different heights (which is essentially what is happening) for the estimates of lambda. It's kind of like clustering different parts of your dataset differently. It might have no effect or only a subtle effect, but I've never tested it!

matt

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Subfamilies as a way to resolve too large families #253

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Subfamilies as a way to resolve too large families #253

Uh oh!

MirkMart Nov 20, 2025

Replies: 1 comment

Uh oh!

hahnlab-user Nov 21, 2025 Maintainer

MirkMart
Nov 20, 2025

hahnlab-user
Nov 21, 2025
Maintainer