Experimental Design Factors Controlling Power to Detect Episodic Fitness Shifts using Comparative Genomic Sequencing Data
Date:
A central challenge in comparative genomics is identifying the genomic basis for lineage-specific adaptations to changing functional requirements. One approach to addressing this problem using genomic sequence data alone, which overcomes limitations of traditional dN/dS approaches, is to model site-heterogeneous sequence-fitness relationships and to infer changes to such relationships along the branches of a phylogeny (inference of ‘fitness shifts’). This requires very parameter-rich models and correspondingly large datasets. However, the factors shaping the detectability of fitness shifts, as well as the distinguishability of fitness shifts from other time-heterogeneous forces such as variation in effective population sizes across lineages, are unclear. Here, we develop a framework for identifying such factors by measuring the asymptotic distinguishability of models of sequence evolution along a fixed phylogeny. We developed an efficient C++ library for modelling and inference of time-heterogeneous (Markov-modulated) mutation-selection codon substitution models – a class of models with an explicit population genetics basis. Using this framework, we measured distinguishability of fitness-shift from time-homogeneous evolution, and fitness-shift from changes in the effective population size in terms of the Kullback-Leibler divergence between models. Using these measurements, we show how asymptotic power analysis can be easily performed to assess minimum sample sizes needed to achieve reasonable power levels. Notably, we focused our analysis on the SARS-CoV-2 phylogeny, given the pressing need to identify functional shifts among variants of concern amidst the ongoing pandemic.