Key results are inflated by an error in the code.
The authors find that the SARS-CoV-2 lineages "A" and "B" are the result of at least two separate cross-species transmission events into humans. This finding is based on Bayes factors calculated in the Jupyter notebook cladeAnalysis.ipynb. This notebook has a bug in the function clade_analysis_updated
. Re-running the main analysis without the bug reduces the Bayes factors by a factor of six. This significantly reduces confidence in the authors' findings.
The sensitivity analyses should be re-run without the bug (this isn't feasible without the data, which have not been published).
There are other deficiencies. In particular, the different hypotheses are tested against different evidence (more specifically, against different conditions derived from the evidence), so that the Bayes factors are heavily biased in favour of the multiple spillover hypothesis. This issue will be addressed in a separate comment.
The Bayes factor calculations require likelihoods of the observed data being produced by the different hypotheses. For the single introduction hypothesis, this is evaluated by generating 1100 simulated viral phylogenies and counting those with topologies that are deemed compatible with the observed data. The compatible topologies are denoted CC
and AB
and correspond, respectively, to the center and right topologies in Figure 2 of the paper (reproduced below).
As shown in Figure 2, the CC
topology has two clades on branches with one mutation from the MRCA ('one-mutation clades'). The AB
topology has a clade on a branch with two mutations from the MRCA (a "two-mutation" clade).
For each simulated viral phylogeny XXXX, the details of the one- and two-mutation clades are collated by the script stableCoalescence_cladeAnalysis.py into the files XXXX_clade_analysis_CC_polytomy.txt
and XXXX_clade_analysis_AB_polytomy.txt
, respectively. The function clade_analysis_updated
extracts these details to clade_analysis_CC_d[XXXX]
and clade_analysis_AB_d[XXXX]
. The details list each clade's size (under the label 'clade_sizes'
) and the sizes of its subclades (under the label 'subclade_sizes
).
Note that the one-mutation clades include all clades on branches with at least one mutation from the MRCA. Likewise, the two-mutation clades include all clades on branches with at least two mutations from the MRCA. This means that the two-mutation clades are a subset of the one mutation-clades. However, the details of the two-mutation clades will not normally be stored at the same place in the lists of clade_analysis_CC_d[XXXX]
and clade_analysis_AB_d[XXXX]
.
A simulated viral phylogeny is deemed to have an AB
topology if it has:
These requirements are tested under # A/B analysis
in the function clade_analysis_updated
in the notebook cladeAnalysis.ipynb. The relevant code is reproduced below
# A/B analysis
ab_count_30perc = 0 # interested in 2 mutations clade that are at least 30% of all taxa
ab_count_30perc_polytomy = 0 # interested in 2 mutations clade that are at least 30% of all taxa + has a basal polytomy
ab_count_30perc_twoPolytomies = 0 # interested in 2 mutations clade that are at least 30% of all taxa + has a basal polytomy + polytomy at 2 mutation clade
lower_constraint = 0.3 # the 2-mutation clade must be at least 30% of all taxa
upper_constraint = 0.7 # the 2-mutation clade must be at most 70% of all taxa
for run in clade_analyses_AB_d:
num_leaves = sum(clade_analyses_CC_d[run]['clade_sizes'])
base_polytomy_size = len(clade_analyses_CC_d[run]['clade_sizes'])
clade_sizes = clade_analyses_AB_d[run]['clade_sizes']
subclade_sizes = clade_analyses_CC_d[run]['subclade_sizes'].copy()
if not clade_sizes: # no 2 mutation clades
continue
if max(clade_sizes) >= lower_constraint*num_leaves and max(clade_sizes) <= upper_constraint*num_leaves: # clades match size restrictions
if len(clade_sizes) == 1:
ab_count_30perc += 1
if base_polytomy_size >= min_polytomy_size: # basal polytomy
ab_count_30perc_polytomy += 1
if len(subclade_sizes[0]) >= min_polytomy_size: # two-mutation clade has polytomy
ab_count_30perc_twoPolytomies += 1
else:
clade2 = sorted(clade_sizes, reverse=True)[1]
ab_count_30perc += 1
if base_polytomy_size >= min_polytomy_size: # basal polytomy
ab_count_30perc_polytomy += 1
max2mutCladeLoc = clade_sizes.index(max(clade_sizes))
if len(subclade_sizes[max2mutCladeLoc]) >= min_polytomy_size: # two mutation clade has polytomy
ab_count_30perc_twoPolytomies += 1
Although the code is somewhat messy, it can be seen that the size of the polytomy at the two-mutation clade is determined in two different ways:
len(clade_sizes) == 1
), the size of the polytomy is taken as len(subclade_sizes[0])
; andmax2mutCladeLoc = clade_sizes.index(max(clade_sizes)
) and the size of the polytomy is taken as len(subclade_sizes[max2mutCladeLoc])
.It can also be seen that clade_sizes
is from the two-mutation list (i.e. clade_sizes = clade_analyses_AB_d[run]['clade_sizes']
), but subclade_sizes
is from the one-mutation list (i.e. subclade_sizes = clade_analyses_CC_d[run]['subclade_sizes'].copy()
).
This means that, rather than testing if the two-mutation clade has a sufficiently large polytomy, the code instead tests a one-mutation clade that is randomly selected, depending on how the clades are ordered in clade_analysis_CC_d[run]
and where the largest (or only) clade was located in clade_analysis_AB_d[run]
.
This does not agree with he AB
topology requirements described in text and the supplementary materials. Moreover, it makes no sense. There can be no valid reason for introducing randomness in this way. Rather, this seems to be the result of a simple copy-paste error, where subclade_sizes = clade_analyses_CC_d[run][ 'subclade_sizes'].copy()
should have been changed to subclade_sizes = clade_analyses_AB_d[run]['subclade_sizes'].copy()
.
The bug can be reproduced with code and data from the GitHub repository.
wget https://raw.githubusercontent.com/sars-cov-2-origins/multi-introduction/main/notebooks/cladeAnalysis.ipynb
wget https://github.com/sars-cov-2-origins/multi-introduction/blob/71ed420fe11ecdbe589568255ec90ca56d6e221c/FAVITES-COVID-Lite/cumulative_results/clade_analyses_CC.zip
wget https://github.com/sars-cov-2-origins/multi-introduction/blob/71ed420fe11ecdbe589568255ec90ca56d6e221c/FAVITES-COVID-Lite/cumulative_results/clade_analyses_AB.zip
unzip simulations_cumulative_results/clade_analyses_CC.zip
unzip simulations_cumulative_results/clade_analyses_AB.zip
After launching the notebook, it may be necessary to comment out any unnecessary imports that cause errors, e.g.
#import treeswift
#from utils import *
.
#import seaborn as sns
The second and sixth code cells must be updated so that clade_analyses_CC_dir
and clade_analyses_AB_dir
point to the newly unpzipped clade_analyses_CC/
and clade_analyses_AB/
directories, e.g.
clade_analyses_CC_dir = 'clade_analyses_CC/'
clade_analyses_AB_dir = 'clade_analyses_AB/'
The seventh and eighth code cells may be deleted, because they require data that has not been published (else they may be ignored when they throw errors).
Re-running the notebook should reproduce the published results of the main analysis.
subclade_sizes_correct
can then be corrected to use the two-mutation list.
for run in clade_analyses_AB_d:
num_leaves = sum(clade_analyses_CC_d[run]['clade_sizes'])
base_polytomy_size = len(clade_analyses_CC_d[run]['clade_sizes'])
clade_sizes = clade_analyses_AB_d[run]['clade_sizes']
####### CC has been changed to AB in the line below###########################
subclade_sizes = clade_analyses_AB_d[run]['subclade_sizes'].copy()
Re-running the notebook should now produce the intended results of the main analysis.
Correcting the error increases the likelihood of a single introduction producing the AB
topology from 0.5% to 3%, and reduces the Bayes factors from ~60 to less than 10.
#1 A related question-- A two-crossover model has more parameters than a one-crossover model. There's one extra crossover date and there may be more implicit adjustability in the early dynamics, although I haven't thought hard about that second point. It's not kosher to use simple likelihood factors to compare hypotheses with different number of parameters, since the best-fit version of the model that adds a parameter will always get higher likelihood. Usually a penalty is added for parameters, e.g. in AIC, etc. If Pekar's BF is down to about 10, and they omitted a parameter penalty, then very little effective BF would be left even before considering issues with the model. Does anybody know if the programs have some intrinsic parameter penalty, or was that really omitted?
Aha, I think this issue can be handled. In the Supplement Pekar et al. say they assume equal prior probabilities for one crossover and two and that they simply exclude all other possibilities. Those are peculiar priors.
Let's assume realistically that the number N of crossovers is described by a Poisson distribution with expectation value x. A conventional non-informative PDF for x would be proportional to 1/x. (The logarithmic divergence of its integral will drop out soon.) This prior seems generous in the context of zero identified wildlife hosts, but at least it's conventional. Then it's trivial to integrate the Poisson probabilities for N=1 and N=2 over x, getting P(2)/P(1)= 1/2. Assuming the bug correction is also right, that reduces the Bayes posterior odds to 5.
Whether the epidemiological model used is realistic and whether N has any relevance to the origins issue are more questions for experts.
The key results are further inflated by another error in the notebook cladeAnalysis.ipynb.
The function calculate_bf
takes an array of values as simulation_results
and divides them by their sum to produce pr_3_topos
- an array of probabilities for the three topologies obtainable from a single introduction:
def calculate_bf(asr_results, simulation_results):
...
# FAVITES simulation information
# the 3 trees are in the order (t_p, t_1C, t_2C)
pr_3_topos = np.array(simulation_results)/sum(simulation_results)
This would be an appropriate way to compute probabilities if the array contained exhaustive count data, but it actually contains the probabilities already, as computed in clade_analysis_update
:
def clade_analysis_updated(clade_analyses_CC_dir, clade_analyses_AB_dir, label, min_polytomy_size=100, _print_=False):
...
polytomy_result = count_atLeastMinDescendants/1100
…
cc_result = cc_count_30perc_twoPolytomies/1100
ab_result = ab_count_30perc_twoPolytomies/1100
…
simulation_results = [polytomy_result, ab_result, cc_result]
…
bf_unconstrained = calculate_bf(unconstrained_results, simulation_results)
bf_recCA = calculate_bf(recCA_results, simulation_results)
The array of probabilities is not exhaustive; it contains only the probabilities for the topologies in Figure 2 (reproduced below), though in a different order.
When divided by their sum, these probabilities increase by discounting the probabilities of topologies that are not depicted in Figure 2. In the primary analysis, this increases the probabilities used for the single introduction topologies from [0.475, 0.005, 0]
to [0.991, 0.009, 0]
and increases the probabilities used for the two introduction topologies from [0.226, 0.002, 0, 0.002, 0.00002, 0, 0, 0, 0]
to [0.981, 0.009, 0, 0.009, 0.0001, 0, 0, 0, 0]
. The net result is doubling of the Bayes factor.
This is obviously wrong. The correct calculation, as described on page 13 of the Supplementary Materials, does not include these increases.
This mistake might have been caused by miscommunication causing the programmer of calculate_bf
to expect count data while the programmer of clade_analysis_update
thought probabilities were required.
As in #1, the bug can be reproduced with code and data from the GitHub repository.
wget https://raw.githubusercontent.com/sars-cov-2-origins/multi-introduction/main/notebooks/cladeAnalysis.ipynb
wget https://github.com/sars-cov-2-origins/multi-introduction/blob/71ed420fe11ecdbe589568255ec90ca56d6e221c/FAVITES-COVID-Lite/cumulative_results/clade_analyses_CC.zip
wget https://github.com/sars-cov-2-origins/multi-introduction/blob/71ed420fe11ecdbe589568255ec90ca56d6e221c/FAVITES-COVID-Lite/cumulative_results/clade_analyses_AB.zip
unzip simulations_cumulative_results/clade_analyses_CC.zip
unzip simulations_cumulative_results/clade_analyses_AB.zip
Unnecessary imports may be commented out, e.g.
#import treeswift
#from utils import *
.
#import seaborn as sns
The second and sixth code cells must be updated so that clade_analyses_CC_dir
and clade_analyses_AB_dir
point to the newly unpzipped clade_analyses_CC/
and clade_analyses_AB/
directories, e.g.
clade_analyses_CC_dir = 'clade_analyses_CC/'
clade_analyses_AB_dir = 'clade_analyses_AB/'
The function calculate_bf
can be edited to print the values used for and :
# FAVITES simulation information
# the 3 trees are in the order (t_p, t_1C, t_2C)
pr_3_topos = np.array(simulation_results)/sum(simulation_results)
pr_trees_given_I1 = np.concatenate([pr_3_topos, np.array([0]*9)])
pr_trees_given_I2 = np.concatenate([np.array([0]*3), np.outer(pr_3_topos, pr_3_topos).flatten()])
### print statements added below ###
print('Value used as P(t_p|I_1):')
print(str(pr_trees_given_I1[0]))
print('Value used as P((t_p,t_p)|I_2):')
print(str(pr_trees_given_I2[3]))
Re-running the notebook should reproduce the published results of the main analysis, but the values used for and are not those shown on page 13 of the Supplementary Materials:
The function calculate_bf
can be edited again to comment out the mistaken division:
# FAVITES simulation information
# the 3 trees are in the order (t_p, t_1C, t_2C)
### the mistaken division is commented out of the line below ###
pr_3_topos = np.array(simulation_results) #/sum(simulation_results)
Re-running the notebook should now show that the values used for and are the same as those shown on page 13 of the Supplementary Materials. Also, the Bayes factors are halved:
Adding the correction from #1 and re-running once more...
The combined corrections reduce the Bayes factors from ~60 to less than 5.
Another error increases the same key results.
The Bayes factor calculations marginalize probabilities over a list of topologies. The first three topologies correspond to the topologies A, C and B depicted in Figure 2 (reproduced in #1 and #6 above). They are:
The topologies are not distinct; a phylogeny with a basal polytomy and a descendant polytomy on a two-mutation branch is counted as both and .
The next nine topologies are the combinations , , , , , , , and .
Four different MRCA haplotypes are considered: , , and . Their compatibility with different topologies is set out on page 13 of the Supplementary Materials.
Posterior probabilities of the different MRCA haplotypes (taken from BEAST) are multiplied with the topology likelihoods (taken from the simulations) through a compatibility matrix that distributes the likelihoods across compatible MRCA haplotypes. The likelihoods for are counted twice in these calculations: firstly as part of in , and then separately in , and .
This duplication can be corrected by negating the compatibility of , and , e.g. by changing the compatibility matrix from:
to
Combining the corrections from #1 and #6, the notebook can be re-run to obtain the correct Bayes factors.
Removing the duplicated likelihoods reduces the Bayes factors by a further ~12%.
A revised version of cladeAnalysis.ipynb has been uploaded to the GitHub repository.
The programming error that undercounted AB topologies (#1) is corrected, as indicated in the Erratum.
The improper normalization of the single-introduction likelihoods (#6) is removed:
The double-counting of two-introduction topologies (#7) is corrected by making and exclusive:
This is mathematically identical to retaining the original and negating compatibility with (,), (,) and (,), as in #7 (the marginalization over the two-introduction topologies is superficial).
Frequencies and Bayes factors are updated in Table 5 of the Supplementary Materials:
The revised results for the primary analysis are slightly different to those in #7 because the revision also removes an undocumented requirement that the two-mutation clade of AB topologies be the largest, as discussed here and verified here.
#9 Really embarrassing correction to #4: The original factor of 2 was correct. P(N=2)/P(N=1) = 1/2 starting with non-informative priors on the expectation value, x. Conditioning on N>0 does eliminate the small-x logarithmic divergence of the normalization but does not change the odds ratio for N=2 vs. N=1, since those probabilities already excluded all N=0 cases. Like I said, it's embarrassing to make a mistake and then have to undo it.
There are discrepancies between the methods described in the text and those carried out by the code.
The effects are difficult to quantify because the stochastic simulations are not reproducible and the results are sensitive to sampling noise. However, resampling the final stochastic phase of the simulations, using corrected code, reduces the Bayes factors of the primary analysis by ~15%.
The simulated phylogenies are constructed by coalescing lineages sampled from the first 50,000 simulated cases. Partial and delayed sampling is simulated by sampling only a portion of simulated cases, and by ignoring samples that precede the first simulated hospitalization (page 8 of the Supplementary Materials).
Samples are also ignored if they are on branches ancestral to a stable coalescence (page 10 of the Supplementary Materials).
Contrary to the described method, the function coalescent_timing
in the script stableCoalescence_cladeAnalysis.py does not stop the tMRCA calculations when 50,000 individuals have been infected. Calculations continue until the last day of the simulation.
def coalescent_timing(time_inf_dict, current_inf_dict, total_inf_dict, tree, num_days=100):
# determine the stable coalescence of the tree
...
times = list(time_inf_dict.keys()) # the number of days in the sim
...
for index, time in enumerate(times):
if time > num_days: # sometimes gemf goes past the limit but we don't always know when
break
labels = time_inf_dict[time] # currently circulating infections
if time == 0:
...
else: # get height of the subtree; the tMRCA
subtree = tree.extract_tree_with(labels)
height_subtree = distance_from_zero_dict[subtree.root.label]
heights.append(height_subtree)
...
coalescent_timing_results = [used_times, heights, total_inf, current_inf, current_samples]
return coalescent_timing_results
For example, the first successful simulation run 0001
reaches 50,000 cases on day 39, when the tMRCA of active sampled cases is 0.000333 years (~3 hours). The calculations continue until the end of the simulation, after another 61 days and 1.3 million cases. By then, only 710 sampled cases are still active and their tMRCA is 0.016277 years (~6 days). This can be seen in the file simulations_01/0001/coalData_parameterized.txt
in the published data on Zenodo.
time coalescence time total infected currently infected current samples
...
38 0.000333 45444 33079 5731
39 0.000333 53606 38638 6175
...
99 0.016277 1363477 149166 743
100 0.016277 1371985 144107 710
The tMRCA from the last day of the simulation is used as the time of stable coalescence. This can be seen by comparing the coalescence times in each simulation's coalData_parameterized.txt
to those recorded in the summary data on GitHub.
Run Coalescent time First ascertained First unascertained First hospitalized infections
0001 0.016277 0.003101 0.018548 0.038307 3
...
This means the code removes basal lineages that do not have active sampled cases at the end of simulation (day 100), even if they do have active sampled cases at the end of sampling period (infection 50,000th case). This does not agree with the method described in the text
As described in the text, the stable coalescence is based on the MRCA of lineages that have a sampled case that is active when the 50,000th case is infected. Lineages are more likely to have a sampled case that is active at this time if they undergo more early growth, so they are more likely to be associated with an early polytomy. By rooting the tree at the stable coalescence, basal lineages that are less likely to be associated with an early polytomy are removed. If the next structure is a polytomy, it is made basal.
As implemented in the code, the stable coalescence is the MRCA of lineages that have a sampled case that is active at the end of the simulation, even though the last sampled case might have been infected months earlier. Lineages are more likely to have a sampled case that is active at this later time if they undergo a lot more early growth, so they are much more likely to be associated with an early polytomy. This means more basal lineages are removed, and more polytomies are made basal.
Lineages are not removed if they descend from the stable coalescence.
The epidemic simulation script FAVITES-COVID-Lite_noSeqgen.py accepts commands to initialise the primary case in arbitrary states. However, the script is hard-coded to initialise the primary case in an infectious state (P1
), so that transmissions begin immediately.
# write GEMF status file
out_file = open(out_fn, 'w')
status_fn = "%s/%s" % (gemf_tmp_dir, GEMF_STATUS_FN)
status_file = open(status_fn, 'w')
seeds = set(sample(range(max_node_label+1), k=num_seeds))
gemf_state = list()
for node in range(max_node_label+1):
if node in seeds:
status_file.write("%d\n" % gemf_state_to_num['P1'])
gemf_state.append(gemf_state_to_num['P1'])
out_file.write("None\t%d\t0\n" % node)
else:
status_file.write("%d\n" % gemf_state_to_num['S'])
gemf_state.append(gemf_state_to_num['S'])
status_file.close()
print_log("Wrote GEMF '%s' file: %s" % (GEMF_STATUS_FN, status_fn))
This is despite the published command indicating that the primary case should start out in the exposed but non-infectious state (--tn_freq_e 0.00000020
).
~/scripts/FAVITES-COVID-Lite-updated.py --gzip_output --path_ngg_barabasi_albert ngg_barabasi_albert --path_gemf GEMF --path_coatran_constant coatran_constant --path_seqgen seq-gen --cn_n 5000000 --cn_m 8 --tn_s_to_e_seed 0 --tn_e_to_p1 125.862069 --tn_p1_to_p2 999999999 --tn_p2_to_i1 23.804348 --tn_p2_to_a1 134.891304 --tn_i1_to_i2 62.931034 --tn_i1_to_h 0.000000 --tn_i1_to_r 62.931034 --tn_i2_to_h 45.061728 --tn_i2_to_r 0.000000 --tn_a1_to_a2 9999999999 --tn_a2_to_r 125.862069 --tn_h_to_r 12.166667 --tn_s_to_e_by_e 0 --tn_s_to_e_by_p1 0 --tn_s_to_e_by_p2 3.513125 --tn_s_to_e_by_i1 6.387500 --tn_s_to_e_by_i2 6.387500 --tn_s_to_e_by_a1 0 --tn_s_to_e_by_a2 3.513125 --tn_freq_s 0.99999980 --tn_freq_e 0.00000020 --tn_freq_p1 0 --tn_freq_p2 0 --tn_freq_i1 0 --tn_freq_i2 0 --tn_freq_a1 0 --tn_freq_a2 0 --tn_freq_h 0 --tn_freq_r 0 --tn_end_time 0.273973 --tn_num_seeds 1 --pt_eff_pop_size 1 --pm_mut_rate 0.00092 --o
That is, the code forces simulations to skip the latent phase of the primary case.
For simulations where the stable coalescence is in the primary case (~20% of simulations), this compresses coalescence events around the stable coalescence, reducing the likelihood of an early mutation breaking up a basal polytomy.
For simulations where the stable coalescence is not in the primary case (~80% of simulations), this pushes the estimated time of introduction forward, and decreases its variance.
The main
function in the script stableCoalescence_cladeAnalysis.py restores basal lineages if their MRCA is sufficiently close to the time of stable coalescence.
# main function
...
coal_timing = coalescent_timing(time_inf_dict, current_inf_dict, total_inf_dict, subtree, args.num_days)
# prepare for clade analysis; get the subtree with the stable coalescence (MRCA) root
eps = 1e-8
stable_coalescence = coal_timing[1][-1]
subtree_sc_leaves = []
for n in subtree.distances_from_root():
if abs(n[1] - stable_coalescence) < eps:
# print(n[0].label)
subtree_sc_leaves += [n.label for n in subtree.extract_subtree(n[0]).traverse_leaves()]
subtree_sc_leaves = set(subtree_sc_leaves)
subtree_sc = tree.extract_tree_with(subtree_sc_leaves)
This can increase the size of basal polytomies, but only in rare cases where coalescence events are compressed closely enough around the stable coalescence.
In one simulation ( 0823
) this occurs in the primary case, where the compression was amplified by the elision of the latent phase.
Clumsy removal of the primary case sample also removes leaves that include the primary case node number in their label string. (The primary case is always sampled to ensure CoaTran roots the phylogeny in the primary case, and is removed for clade analysis.)
# get subtree that excludes index case
tree = treeswift.read_tree_newick(args.time_tree)
subtree_leaves = []
for n in tree.traverse_leaves():
if index_case not in n.label:
subtree_leaves.append(n.label)
subtree = tree.extract_tree_with(subtree_leaves)
subtree.suppress_unifurcations()
This is a very minor error, mentioned here only because it was corrected in subsequent reanalyses.
The Bayes factors are described on pages 11 to 14 of the Supplementary Materials. Combined and expanded, their equations can be written as:
where:
The data includes RNG seeds for the epidemic simulations, but not for sample times, coalescence, or mutation simulations. Therefore, the simulations cannot be reproduced. However, assuming the published likelihoods are sufficiently accurate, the results of the 1100 simulations can be replicated by sampling appropriate binomial distributions ( is neglected here because its measured likelihood was zero).
import numpy as np
np.random.seed(42)
def sample_likelihoods():
p_tau_p_given_i1 = 0.475 # from Fig. 2
p_tau_1c_given_i1 = 0.031 # from Fig. 2
# sample the number of simulations that have basal polytomies
n_tau_p = np.random.binomial(1100,p_tau_p_given_i1)
# sample the number of basal polytomies that have another on a two mutation branch
p_tau_1c_given_tau_p = p_tau_1c_given_i1/p_tau_p_given_i1
n_tau_1c = np.random.binomial(n_tau_p,p_tau_1c_given_tau_p)
# return the likelihoods
return n_tau_p/1100, n_tau_1c/1100
The Bayes factor equation can be written in terms of the likelihoods and posterior probabilities.
def compute_bfs(p_tau_p_given_i1, p_tau_1c_given_i1, posteriors):
bf = 0.25*p_tau_p_given_i1**2*sum(posteriors)/(0.5*p_tau_1c_given_i1*sum(posteriors[:2]))
return bf
Repeatedly resampling the likelihoods and computing the Bayes factors then provides a distribution of results that would be expected from replicating the analysis.
unconstrained_posteriors = np.array([1.68, 80.85, 10.32, 0.92])/100 # from Table 1
results = []
for i in range(20000): # sample 20000 times
p_tau_p_given_i1, p_tau_1c_given_i1 = sample_likelihoods()
results.append(compute_bfs(p_tau_p_given_i1, p_tau_1c_given_i1, unconstrained_posteriors))
results.sort()
print(f'95% CDI of Bayes factors with unconstrained rooting: {results[500]:.1f}, {results[19500]:.1f}')
95% CDI of Bayes factors with unconstrained rooting: 3.1, 6.0
The central 95% of the distribution has a range similar to the size of the Bayes factors.
The epidemic simulation script FAVITES-COVID-Lite_noSeqgen.py can be corrected to initialise the primary case in the exposed compartment (E
). However, rather than repeating the epidemic simulations, the published data stored on Zenodo can be reused by sampling times for the latent periods and adding them to the existing transmission and sampling times. CoaTran can then be rerun to obtain corrected phylogenies.
import os
import numpy as np
from gzip import open as gopen
np.random.seed(42)
# go through each of the 1100 simulations
for i in range(1,1101):
# generate a latent period
latent_period = np.random.exponential(2.9/365) #expected value is 2.9/365 days (Table S3)
# set up file paths, open files and write out the corrected data for transmission network
old_tn_path = f'simulations/{i:04d}/transmission_network.subsampled.txt.gz'
new_tn_path = f'simulations/{i:04d}/transmission_network.subsampled.corrected.txt'
with gopen(old_tn_path, 'rt') as old_tn, open(new_tn_path, 'w') as new_tn:
for line in old_tn:
u, v, t = line.strip().split('\t')
new_tn.write(f'{u}\t{v}\t{float(t) + latent_period:.6f}\n')
# set up file paths, open files and write out the corrected data for sample times
old_samples_path = f'simulations/{i:04d}/subsample_times.txt.gz'
new_samples_path = f'simulations/{i:04d}/subsample_times.corrected.txt'
with gopen(old_samples_path, 'rt') as old_samples, open(new_samples_path, 'w') as new_samples:
for line in old_samples:
u, t = line.strip().split('\t')
new_samples.write(f'{u}\t{float(t) + latent_period}\n')
# setup and run CoaTRan to generate a corrected time tree
command = ['coatran_constant', new_tn_path, new_samples_path, '1']
env = os.environ.copy()
coatran_rng_seed = np.random.randint(low=0, high=2**31) # ensure it is reproducible
env["COATRAN_RNG_SEED"] = str(coatran_rng_seed)
result = subprocess.run(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE, env=env)
tree_path = f'simulations/{i:04d}/tree.time.subsampled.corrected.nwk'
with open(tree_path, 'w') as file_handle:
file_handle.write(result.stdout.decode())
The script stableCoalescence_cladeAnalysis.py can be corrected to determine the stable coalescence properly by:
coalescent_timing
once 50,000 individuals have been infected, walking back to find the first day when tMRCA of active sampled cases will jump forward by less than one day, and returning the MRCA of that day as the stable coalescence; andmain
function to extract the subtree rooted at the stable coalescence.Correcting the removal of the primary case sample is not really necessary, but simple enough.
Complete code and instructions for reproducibly generating corrected time trees is published in this branch of the authors' repository. In order to suppress sampling noise, the code also automates resampling of the mutation simulations and re-analysis of the resulting clades, using the correct stable coalescence, 1000 times.
The corrections and resampling reduce the Bayes factors by ~15%.
The derivation of the Bayes factors is described over pages 11 to 14 of the Supplementary Materials. The approach is set out on page 11:
That is, the Bayes factors compare the likelihoods of sequence data arising from two introductions versus one introduction :
Page 14 describes how the Bayes factor is taken from the combination of the posterior and prior odds:
Page 12 rewrites the posterior probabilities and breaks them down over the relevant MRCA haplotypes :
Page 13 breaks down the probabilities and over the relevant topologies :
Page 13 then describes the compatibility between the different topologies and MRCA haplotypes:
As noted in #7, the topology encompasses the sub-topologies , and . Therefore, it is sufficient to state:
Page 13 then describes how the compatibility statements define :
This description is wrong. The renormalization is applied when multiple MRCA haplotypes are compatible with a topology. The compatibility statements allow the renormalised compatibility vectors for the three relevant topologies to be written out as:
These allow the equation to be expanded out, with the priors cancelled by their inverse :
This is mathematically identical to the authors' equations, but simpler because it avoids the superfluous marginalization over the sub-topologies , and .
The authors inferred the posterior probabilities , , and using the phylodynamic software BEAST, producing results shown in Table 1. With the results for the recCA and unconstrained rootings, the Bayes factor equations become:
The authors measured the likelihoods , and from the frequencies with which successful simulated introductions produced the topologies , and .
is defined on page 10:
was then calculated as . Therefore, is the likelihood of two successful introductions producing two clades where there is a polytomy at the root of each clade.
and are defined on page 11:
Therefore, and are the likelihoods of a single successful introduction producing two clades where there is a polytomy at the root of each clade, and
The additional conditions 1 and 2 reduce the single introduction likelihoods. This makes the published Bayes factors (4.2 and 4.3) meaningless, since they may be caused by this reduction rather than a higher plausibility of the two introduction hypothesis.
This can be corrected by applying conditions 1 and 2 to the likelihood .
This comment examines the effects of condition 1. A subsequent comment will examine condition 2.
Two simulations can be be tested against condition 1 by going through the first 50,000 infections amongst both, and comparing the number of samples from each, e.g.:
def test_sizes(run_1,run_2):
# read samples into separate sets for each run
sample_dict = {run_1: set(), run_2: set()}
# read transmissions into one list for both runs
transmission_list = []
for run in [run_1, run_2]:
with open(f'simulations/{run:04d}/subsample_times.corrected.txt', 'r') as file:
for line in file:
u, t = line.strip().split('\t')
sample_dict[run].add(u)
with open(f'simulations/{run:04d}/transmission_network.subsampled.corrected.txt', 'r') as file:
for line in file:
u,v,t = line.strip().split()
if u != v:
transmission_list.append((v,float(t),run))
# sort the transmissions
transmission_list = sorted(transmission_list, key=lambda x: x[1])
# start sample counts from -1 to compensate for the primary sample
n_valid_samples = {run_1: -1, run_2: -1}
# got through the transmissions from earliest to latest
for event_no, transmission in enumerate(transmission_list):
# check if the transmission is amongst the samples of its run
if transmission[0] in sample_dict[transmission[2]]:
# increment the count for that run
n_valid_samples[transmission[2]] += 1
# stop after the first 50,000
if event_no == 50000:
break
# compare the numbers of sampled transmissions to the relative size condition
if 0.3 <= n_valid_samples[run_1]/(n_valid_samples[run_1] + n_valid_samples[run_2]) <= 0.7:
return True
else:
return False
The likelihood of two successful introductions producing basal polytomies and satisfying condition 1 can be measured by repeatedly drawing random pairs of simulations and testing them, e.g.:
# sample 1100 topologies, as for the others
for i in range(1100):
# draw two simulations
run_1, run_2 = np.random.choice(range(1, 1101), 2, replace=True)
# get the clade sizes
clade_sizes_1 = clade_analyses_CC_d[run_1]['clade_sizes']
clade_sizes_2 = clade_analyses_CC_d[run_2]['clade_sizes']
# test them for basal polytomies
if len(clade_sizes_1) >= min_polytomy_descendants and len(clade_sizes_2) >= min_polytomy_descendants:
# test their sizes
if test_sizes(run_1, run_2):
count_atLeastMinDescendants += 1
The function clade_analysis_updated
in the notebook stableCoalescence_cladeAnalysis.py can be adapted to do this instead of counting the number of simulations with basal polytomies. The function calculate_bf
can then be adapted to use the resulting likelihood as . The topology encompasses all topologies compatible with the two introduction hypothesis, so the likelihoods of the other two introduction topologies ( , , ) must be set to zero to avoid double counting.
def calculate_bf(asr_results, simulation_results):
# Let t_p be a polytomy, t_1C be one clade, and t_2c be two clades. Note that t_p includes t_1c. t_p equals all topologies with a basal polytomy (Fig. 2a).
# trees are in the order
# (t_p, t_1C, t_2C, (t_p,(t_p,t_1C,t_2C)), (t_1C,(t_p,t_1C,t_2C)), (t_2c,(t_p,t_1C,t_2C)))
...
pr_trees_given_I2 = np.concatenate([np.array([0]*3), np.array([simulation_results[0]]), np.array([0]*8)])
The notebook can then be rerun to produce the following results.
Inconsistent treatment of the different hypotheses inflated the Bayes factors by a factor of six.
Note that the authors measured the two week doubling times of the simulated epidemics at the 1000th infection, and found a 95% highest density interval (HDI) of 1.35 to 5.44 days. This range of early growth rates suggests that two introductions are unlikely to grow to similar sizes within the first 50,000 infections.
Code and instructions for reproducing these results are available at this branch of the authors' repository.
test_sizes
assumes both introductions are simultaneous. This maximises the likelihood of the two introductions having similar sizes;Overall, the limitations increase the likelihood of two simulations producing two basal polytomies and satisfying conditions 1, so the corrected Bayes factors may be considered a rough upper bound.
It is theoretically possible that applying condition 2 to the two introduction likelihoods could increase the Bayes factors. This would require an increase in the likelihood of two introductions producing topologies compatible with the more probable MRCA haplotypes (i.e. A or B). This is considered unlikely, but will be confirmed either way in the next comment.
The Pekar et al. simulations were independently (and somewhat differently) re-coded by Peter Miller. Contrary to Neoacanthoparyphium echinatoides, Miller confirmed Pekar et al.'s conclusions and orders of magnitude.
Proportion of simulations with two main lineages:
https://x.com/tgof137/status/1772418937579311526?s=20
Timing of the introduction of SARS-CoV-2 in humans:
https://x.com/tgof137/status/1772417301561708753?s=20
#13 the results of the different simulations can be compared directly. For the proportion of simulations with two main lineages, they are:
These results do not seem contradictory.
Miller hasn’t published details of his analyses or his code, but it seems he compares the likelihood of a single successful introduction producing two basal polytomies that are two mutations apart and have similarly sized clades, against the likelihood of two successful introductions each having a basal polytomy.
It therefore seems that Miller repeats the logic errors described in #12.
Miller’s simulations and analyses seem significantly different from those of Pekar et al. For example, rather than simulating the epidemic over a scale-free network, Miller uses a branching process with a negative binomial offspring distribution. It also seems that, unlike Pekar et al., Miller neglects basal polytomies that have root mutations. The effects of these differences might be interesting, but it is difficult to assess their significance without knowing more details of the analyses and code.
Further to #12, two successful introductions can satisfy the two mutation separation constraint either as:
These correspond, respectively, to the single introduction topologies and . They have the same compatibility:
With these more specific two-introduction topologies, the compatibility vectors become:
The Bayes factor equations for the unconstrained and recCA rootings then become:
The mutations that determine the topology of two successful introductions can be sampled using the molecular clock and the times between the MRCA and each clade root. Pekar et al.'s simulations provide the times between each introduction and subsequent clade root. The times between the MRCA and each introduction require an additional model. Absent information on upstream population sizes, structures and dynamics, a reasonable model samples the time between the MRCA and the first introduction from an exponential distribution with an expected value . Similarly, absent information on the interface between the source and the human population, a reasonable model samples the time between introductions from an exponential distribution with an expected value .
The time between introductions can also be incorporated into the relative size test, replacing the maximum likelihood assumption of simultaneous introductions.
Note there is some ambiguity about the separation constraint. Readers may assume that two clades must be separated by two mutations. In the code, the constraint is two or more mutations. For completeness, results for both constraints are presented here. The stricter constraint reduces the single introduction likelihoods: from 0.2 to 0.1; from 3.3 to 2.5.
Using an accordingly modified version of Pekar et al.'s code (available at this branch of their repository), Bayes factors were calculated for all combinations of the constraint interpretation (two and two or more mutations), phylogeny rooting (recCA and unconstrained), a range of plausible expected values (representing the upstream effective population size) and a range of plausible expected values (representing the introduction intensity).
Two or more mutation separation and recCA rooting:
\ | 0 days | 1 day | 2 days | 4 days | 7 days | 14 days | 28 days |
---|---|---|---|---|---|---|---|
0 days | 0.21 | 0.22 | 0.23 | 0.23 | 0.23 | 0.21 | 0.16 |
1 day | 0.23 | 0.24 | 0.25 | 0.25 | 0.25 | 0.21 | 0.17 |
2 days | 0.24 | 0.24 | 0.24 | 0.25 | 0.24 | 0.22 | 0.17 |
4 days | 0.25 | 0.26 | 0.25 | 0.26 | 0.26 | 0.22 | 0.16 |
7 days | 0.26 | 0.27 | 0.26 | 0.26 | 0.26 | 0.21 | 0.16 |
14 days | 0.27 | 0.27 | 0.26 | 0.26 | 0.24 | 0.21 | 0.16 |
28 days | 0.24 | 0.24 | 0.24 | 0.24 | 0.23 | 0.19 | 0.14 |
Two or more mutation separation and unconstrained rooting:
\ | 0 days | 1 day | 2 days | 4 days | 7 days | 14 days | 28 days |
---|---|---|---|---|---|---|---|
0 days | 0.21 | 0.21 | 0.22 | 0.23 | 0.22 | 0.20 | 0.16 |
1 day | 0.22 | 0.23 | 0.24 | 0.24 | 0.24 | 0.20 | 0.16 |
2 days | 0.23 | 0.23 | 0.23 | 0.24 | 0.23 | 0.21 | 0.16 |
4 days | 0.24 | 0.25 | 0.24 | 0.25 | 0.25 | 0.21 | 0.15 |
7 days | 0.25 | 0.26 | 0.25 | 0.24 | 0.25 | 0.20 | 0.15 |
14 days | 0.25 | 0.25 | 0.24 | 0.24 | 0.22 | 0.19 | 0.14 |
28 days | 0.22 | 0.22 | 0.22 | 0.22 | 0.20 | 0.17 | 0.13 |
Exactly two mutation separation and recCA rooting:
\ | 0 days | 1 day | 2 days | 4 days | 7 days | 14 days | 28 days |
---|---|---|---|---|---|---|---|
0 days | 0.16 | 0.16 | 0.17 | 0.17 | 0.16 | 0.13 | 0.10 |
1 day | 0.17 | 0.17 | 0.18 | 0.17 | 0.17 | 0.13 | 0.10 |
2 days | 0.18 | 0.17 | 0.17 | 0.17 | 0.16 | 0.13 | 0.09 |
4 days | 0.17 | 0.17 | 0.17 | 0.17 | 0.16 | 0.13 | 0.08 |
7 days | 0.16 | 0.17 | 0.16 | 0.15 | 0.15 | 0.11 | 0.07 |
14 days | 0.14 | 0.14 | 0.13 | 0.12 | 0.11 | 0.09 | 0.06 |
28 days | 0.09 | 0.09 | 0.09 | 0.09 | 0.08 | 0.06 | 0.04 |
Exactly two mutation separation and unconstrained rooting:
\ | 0 days | 1 day | 2 days | 4 days | 7 days | 14 days | 28 days |
---|---|---|---|---|---|---|---|
0 days | 0.16 | 0.16 | 0.16 | 0.16 | 0.15 | 0.13 | 0.09 |
1 day | 0.17 | 0.17 | 0.18 | 0.17 | 0.16 | 0.13 | 0.10 |
2 days | 0.17 | 0.17 | 0.17 | 0.17 | 0.15 | 0.13 | 0.09 |
4 days | 0.17 | 0.17 | 0.16 | 0.17 | 0.16 | 0.13 | 0.08 |
7 days | 0.16 | 0.17 | 0.16 | 0.15 | 0.14 | 0.11 | 0.07 |
14 days | 0.14 | 0.14 | 0.13 | 0.12 | 0.11 | 0.09 | 0.06 |
28 days | 0.09 | 0.09 | 0.09 | 0.08 | 0.07 | 0.06 | 0.04 |
Irrespective of how the constraint is interpreted, how the phylogeny is rooted, and what priors are assumed for the upstream effective population size and introduction intensity, the Bayes factor never exceeds 0.3.
When the analysis is corrected so that the two-introduction model satisfies the same constraints as the single introduction model, the Bayes factors provide at least moderate support (< 0.3) for the single introduction hypothesis.
I think it has not been noted above that the error described in #1 was not present in the preprinted version of the work: https://zenodo.org/records/6342616. In Figure 5 of the preprint, the correct probability of the topology shown in (C) is shown - which matches the currently corrected version of the paper at Science. The error was therefore introduced during the review process. This is probably important to note, primary because the conclusions of the peer reviewed paper are not different from the preprint.
It is also worth noting that the Bayes Factor analysis described in #12 and #15 was also not included in the preprint version of the work, which may again be valuable to note as the conclusions of the peer reviewed paper are not different from the preprint.
The alternative modeling proposed above in #12 and #15 adds an additional model for the two spillovers scenario, simulating two simultaneous successful introductions, and then two successful introductions spaced apart. This modeling is essentially telling us: "The observed tree structure is very unlikely under a single introduction; however, it is also unlikely under two introductions". However, it is rather interesting that neither of the models proposed in #12 or #15 can explain the observed data well. To me, this could indicate that the model proposed above for the multiple spillover scenario is inappropriate.
The question that immediately arises is therefore: what model could explain the data well? One model that might do so is multiple index cases (maybe simulated from 5-20?) of each of the two genetically distinct spillovers. My guess is this would be more likely to result in the two similarly sized polytomies observed than either of the models compared in #12 and #15. Such a model would also be a better representation of zoonotic spillovers than that proposed by #12, as when a close group of infected animals are together moved into a confined space with many humans, multiple humans could be expected to be infected. Actually, under such a scenario, it would be very strange to expect just one index case! Such a model could be tested to see how well it explains the data.
I believe such a model is in line with the thinking of the original paper, which consistently says "at least two" introductions, which can be presumed to refer to the two easily discernable genetically distinct lineages A and B, but leaves open the possibility of multiple human index cases of each. If the model above did explain the data better than either the single introduction single index case or the two introductions each with one index case models, then that would fascinatingly provide some degree of phylodynamic evidence for multiple index cases of each lineage.
Another piece of evidence is provided in the original paper for multiple spillovers: "However, the inability to reconcile the molecular clock at the outset of the COVID-19 pandemic with a lineage A ancestor without information from related sarbecoviruses (such as the recCA) requires us to question the assumption that both lineages A and B resulted from a single introduction.". And further "Further complicating this matter is the molecular clock of SARS-CoV-2 in humans, which rejects a single-introduction origin of the pandemic from a lineage A virus.". As linked above by another commenter, Peter Miller attempted to quantify this argument which exists in a qualitative form: https://x.com/tgof137/status/1772422322760139243. However, the modeling described in the original paper, #1, #12, and #15, do not include a quantification of this evidence, and so this could represent an important avenue for future work to develop a more quantitative whole picture of the question of the number of spillovers.
Tangentially, I am confused by how the results described in #12 and #15 could give a BF lower than 1. I can understand how that model could give BF close to 1, but not substantially less. I am curious if anyone has an intuitive explanation for why that might be.
I think the issues described above in my points 4 and 6 would be the most pertinent updates to the modeling proposed by #12 and #15, and indeed, may substantially advance the findings of the original work as well. However, there are also "real world" challenges to the modeling here: if there were multiple spillovers, they likely did not occur in a truly independent fashion. This may present an inherent challenge to correctly modeling what multiple spillovers "should" look like. There is also a chance that we may have to be content to say that a single introduction model cannot well explain the observed data, a finding that does not seem disputed by #12 or #15 above.
(Apologies, I forgot my initial sign-in code, but have saved to use the one under this username)
two_polytomy_results = 0
count_atLeastMinDescendants2 = 0
for i in range(1100):
run_1, run_2 = np.random.choice(range(1, 1101), 2, replace=True)
clade_sizes_1 = clade_analyses_CC_d[run_1]['clade_sizes']
clade_sizes_2 = clade_analyses_CC_d[run_2]['clade_sizes']
num_leaves_1 = sum(clade_analyses_CC_d[run_1]['clade_sizes'])
num_leaves_2 = sum(clade_analyses_CC_d[run_2]['clade_sizes'])
total_leaves = num_leaves_1 + num_leaves_2
# test them for basal polytomies
if len(clade_sizes_1) >= min_polytomy_descendants and len(clade_sizes_2) >= min_polytomy_descendants:
# test their sizes
count_atLeastMinDescendants += 1
if num_leaves_1 > total_leaves*0.30 and num_leaves_1 < total_leaves*0.70:
if num_leaves_2 > total_leaves*0.30 and num_leaves_2 < total_leaves*0.70:
count_atLeastMinDescendants_even2 += 1
In this code, I am comparing the sizes of the two sampled runs directly, instead of going back to the transmission chain, as is done with the single introduction constraint. My current understanding is that the only difference with #12 this would introduce is that the two lineage scenario would be sampling to 100,000 instead of 50,000 cases: is there another aspect I am forgetting?
My concern with the implementation in #12 is that it is introducing a constraint that correspondingly does not exist in the one intro scenario. In the one intro AB scenario, the constraint is performed directly on the number of observed taxa, not on the total number of simulated cases. I am not sure if one would expected to be different, but it is just an inconsistency in this implementation. Perhaps the author could explain.
My code above returns these proportions:
Proportion with a polytomy at the base: 0.475455
Proportion of two intros with two polytomies at the base: 0.227273
Proportion of two intros with two even polytomies at the base: 0.226364
Which are largely in line with the original work. So, two (single index case) introductions quite commonly produce two polytomies (at approximately 0.47^2, as expected), and only slightly fewer examples of two (single index case) intros result in drastically different sizes.
Presumably, this means that the main reason why #12 gets such different numbers is in whether the polytomies produced by two successful intros are of similar sizes. But conceptually, why might this be happening based on the code presented in #12 and not in the code presented above? I think it's just helpful to walk through why the results in #12 were so different.
Previously I was confused about how #12 could derive a BF < 1, and am still interested in an intuitive explanation of why this might be. But I now just realized that the % of simulations passing for the two intro scenario and the one intro scenario in #12 are exactly the same: 3.34%. Does the author have an explanation for why that might be? It is hard to come up with a mental model as to why the probabilities under two and one simultaneous intros would be completely identical, and clearly it deviates from my result above.
Finally, I was thinking more about the additional modeling proposed in #15 about the timing of the spillovers. While previously I noted these models assume independence of spillovers that may be violated in the real world, but I think they might be wholly inappropriate given the hypothesis put forth in the original paper. The hypothesis is essentially that two separate introductions of A and B explains their incongruence with the molecular clock. Therefore, by definition of the hypothesis, the timing of the spillovers did not occur straightforwardly or sampled from an exponential distribution, but due to some (presumably hard to model) "weirdness". The hypothesis is that this "weirdness" was the unexpected dynamics in the unusual bottlenecks that may be expected to occur upstream of a spillover event, when a small number of animals would be moved from source populations to human-animal interfaces.
#17 Unlike the single-introduction constraint, and #12, you are not comparing the number of taxa sampled from two clades in a subsample (the first fifty thousand infections). You are comparing the number of taxa sampled from two equally sized subsamples (the first fifty thousand infections of each clade). Except for rare occasions when a simulation ends before filling the subsample, equally sized subsamples are always the same size, with a similar number of taxa. That is why your result is different.
Please reconsider your comments.
Oh yes, I see now that you have to go back to the simulations before the sub-sampling when comparing two introductions, but not one, because they are only equally within simulation: this resolves my question posed in #17, point 1, thank you!
While there are many pointed raised in #16 points 1-8 and #17 points 2-3, I assume readers can digest them and they don't have to be repeated, so we can focus on the crux of the matter after the resolution above. The major question then is why the tree ended up with two similarly sized polytomies with a discrepancy in the molecular clock. What Pekar et al. 2022 showed and the comments above replicate well, was that the former was very unlikely to occur under a model of a single introduction. The paper suggested, but did not formally quantify that the latter aspect was also unlikely to occur under a model of a single introduction, while the twitter analysis by Peter Miller linked above did.
What the comments above in #12, #15, and #16 further show is that under a newly proposed model of two successful spillovers started each by one index case, two polytomies are very likely to occur, but unlikely to be evenly sized. This all leaves open the question of what model does explain how you get two evenly sized polytomies at the start of the epidemic.
I think there are two straightforward hypotheses:
Multiple index cases of each lineage: this would essentially guarantee a polytomy right at the start of the transmission chain in each case, making it more likely for the introductions to become similarly sized.
Non-independence of the emergence of each lineage: if the lineages emerged in the same environment, one which was poised for rapid spread at the onset, then that could cause them to both form polytomies right at the start of emergence, making them more likely to be evenly sized.
I am curious what other models can be proposed that could explain the data, given that a single introduction does not. If the results of #12 and #15 can be interpreted as evidence for either of the hypotheses above, that is most interesting, I think. Perhaps this was hinted at indirectly by the original paper, but it wasn't explicit, so there is room for future work to build on this (and of course modeling of the clock discrepancy).
#19 Pekar et al. quantify support for their conclusions using a model that assumes the two lineages emerged independently from two introductions .
It might be possible to salvage support for Pekar et al.'s conclusions by, as you suggest, relinquishing those assumptions.
I don't think it would be appropriate to make such a drastic change to a published analysis.
I uploaded the wrong version of the script stableCoalescence_cladeAnalysis.py to the GitHub repository for reproducing #15.
I've now fixed it and am re-running the analysis. I should have results in a few days.
I'm very sorry to anyone trying to reproduce #15 using my code.
Results from the rerun are practically identical to #15. There are some small differences that are probably due to resource handling variations allowing different parallel processes to get ahead and draw different pseudo random numbers. The next analysis will seed separate generators for each process.
One of the authors, Robert Garry, made the following comments in his testimony to the U.S. Senate Committee on Homeland Security and Governmental Affairs:
Robert Garry's contributions to the paper are listed as "writing - review and editing". It is possible that he was misinformed about the extent of the amendments to the code, which are summarized in #8.
It is nice to see the updated results in #22, which importantly again confirm that the observed data would only rarely be expected under a single introduction (or the author of #22's particular model of two introductions). Needless to say, but to do so nonetheless, the points raised in #16 1-8 and #17 points 2-3 are therefore still relevant to the results of #22.
Garry's testimony would appear to be correct, because as noted in #16, the correction issued by the paper simply matches the numbers provided in the preprint version of the work. As the preprint's conclusions did not differ from the conclusions of the published paper, Garry's assessment of the extent of the amendments as limited in how they impact conclusions is entirely consistent. Indeed, given this, it would be challenging to imagine another interpretation thereof.
#15 sampled timings for the two successful introductions using simple models (the continuous approximation of the coalescent under a fixed effective population size for the time between the MRCA and the first successful introduction, and a Poisson process for the time between the first successful introduction and the second). These have high variance, which is not completely inappropriate given the lack of information.
Rerunning the analysis with fixed timings removes this variance, so that the Bayes factor topographies become more pronounced.
Two or more mutation separation and recCA rooting:
\ | 0 days | 1 day | 2 days | 4 days | 7 days | 14 days | 28 days |
---|---|---|---|---|---|---|---|
0 days | 0.21 | 0.22 | 0.23 | 0.24 | 0.26 | 0.25 | 0.14 |
1 day | 0.22 | 0.23 | 0.23 | 0.25 | 0.27 | 0.26 | 0.14 |
2 days | 0.24 | 0.25 | 0.25 | 0.26 | 0.27 | 0.25 | 0.13 |
4 days | 0.26 | 0.27 | 0.27 | 0.28 | 0.28 | 0.26 | 0.13 |
7 days | 0.28 | 0.29 | 0.29 | 0.29 | 0.29 | 0.25 | 0.12 |
14 days | 0.30 | 0.29 | 0.29 | 0.29 | 0.27 | 0.22 | 0.10 |
28 days | 0.25 | 0.24 | 0.24 | 0.23 | 0.21 | 0.17 | 0.08 |
Two or more mutation separation and unconstrained rooting:
\ | 0 days | 1 day | 2 days | 4 days | 7 days | 14 days | 28 days |
---|---|---|---|---|---|---|---|
0 days | 0.21 | 0.21 | 0.22 | 0.23 | 0.25 | 0.24 | 0.13 |
1 day | 0.22 | 0.22 | 0.23 | 0.24 | 0.26 | 0.24 | 0.13 |
2 days | 0.23 | 0.24 | 0.24 | 0.25 | 0.26 | 0.24 | 0.12 |
4 days | 0.25 | 0.25 | 0.26 | 0.26 | 0.26 | 0.24 | 0.12 |
7 days | 0.27 | 0.27 | 0.28 | 0.28 | 0.27 | 0.23 | 0.11 |
14 days | 0.28 | 0.27 | 0.27 | 0.27 | 0.25 | 0.21 | 0.09 |
28 days | 0.22 | 0.22 | 0.21 | 0.21 | 0.19 | 0.15 | 0.07 |
Exactly two mutation separation and recCA rooting:
\ | 0 days | 1 day | 2 days | 4 days | 7 days | 14 days | 28 days |
---|---|---|---|---|---|---|---|
0 days | 0.16 | 0.17 | 0.17 | 0.18 | 0.18 | 0.15 | 0.05 |
1 day | 0.17 | 0.17 | 0.17 | 0.18 | 0.18 | 0.15 | 0.05 |
2 days | 0.18 | 0.18 | 0.18 | 0.18 | 0.18 | 0.14 | 0.04 |
4 days | 0.19 | 0.18 | 0.18 | 0.18 | 0.17 | 0.13 | 0.04 |
7 days | 0.18 | 0.18 | 0.18 | 0.17 | 0.16 | 0.11 | 0.03 |
14 days | 0.14 | 0.14 | 0.13 | 0.12 | 0.10 | 0.07 | 0.02 |
28 days | 0.05 | 0.04 | 0.04 | 0.04 | 0.03 | 0.02 | 0.00 |
Exactly two mutation separation and unconstrained rooting:
\ | 0 days | 1 day | 2 days | 4 days | 7 days | 14 days | 28 days |
---|---|---|---|---|---|---|---|
0 days | 0.16 | 0.16 | 0.17 | 0.17 | 0.18 | 0.15 | 0.05 |
1 day | 0.17 | 0.17 | 0.17 | 0.17 | 0.18 | 0.15 | 0.05 |
2 days | 0.18 | 0.18 | 0.18 | 0.18 | 0.18 | 0.14 | 0.04 |
4 days | 0.18 | 0.18 | 0.18 | 0.18 | 0.17 | 0.13 | 0.03 |
7 days | 0.17 | 0.18 | 0.17 | 0.17 | 0.15 | 0.11 | 0.03 |
14 days | 0.13 | 0.13 | 0.13 | 0.12 | 0.10 | 0.06 | 0.02 |
28 days | 0.05 | 0.04 | 0.04 | 0.04 | 0.03 | 0.02 | 0.00 |
Note that a Bayes factor should be marginalized over prior distributions for each of the parameters and . However, even if there were information that could justify a degenerate prior for parameter values that maximized the Bayes factor (e.g. = 14 days, = 0 days), there would still be support (Bayes factor ≤ 0.3) for the single introduction hypothesis.
This discussion is getting lost in details and seems to be missing the main point of the paper being discussed.
Quick questions for Neoacanthoparyphium echinatoides:
Are the data compatible with a single introduction?
Attach files by dragging & dropping, selecting them, or pasting from the clipboard. Uploading your files… We don’t support that file type. with a PNG, GIF, or JPG. Yowza, that’s a big file. with a file smaller than 1MB. This file is empty. with a file that’s not empty. Something went really wrong, and we can’t process that file.
Comment must be at least 15 characters.