The molecular epidemiology of multiple zoonotic origins of SARS-CoV-2

Jonathan E. Pekar; Andrew Magee; Edyth Parker; Niema Moshiri; Katherine Izhikevich; Jennifer L. Havens; Karthik Gangavarapu; Lorena Mariana Malpica Serrano; Alexander Crits-Christoph; Nathaniel L. Matteson; Mark Zeller; Joshua I. Levy; Jade C. Wang; Scott Hughes; Jungmin Lee; Heedo Park; Man-Seong Park; Katherine Zi Yan Ching; Raymond Tzer Pin Lin; Mohd Noor Mat Isa; Yusuf Muhammad Noor; Tetyana I. Vasylyeva; Robert F. Garry; Edward C. Holmes; Andrew Rambaut; Marc A. Suchard; Kristian G. Andersen; Michael Worobey; Joel O. Wertheim

Science (2022) - 25 Comments
pubmed: 35881005 doi: 10.1126/science.abp8337 issn: 0036-8075 issn: 1095-9203

Jonathan E. Pekar , Andrew Magee , Edyth Parker , Niema Moshiri , Katherine Izhikevich , Jennifer L. Havens , Karthik Gangavarapu , Lorena Mariana Malpica Serrano , Alexander Crits-Christoph , Nathaniel L. Matteson , Mark Zeller , Joshua I. Levy , Jade C. Wang , Scott Hughes , Jungmin Lee , Heedo Park , Man-Seong Park , Katherine Zi Yan Ching , Raymond Tzer Pin Lin , Mohd Noor Mat Isa , See more

#1 Neoacanthoparyphium echinatoides

Key results are inflated by an error in the code.

The authors find that the SARS-CoV-2 lineages "A" and "B" are the result of at least two separate cross-species transmission events into humans. This finding is based on Bayes factors calculated in the Jupyter notebook cladeAnalysis.ipynb. This notebook has a bug in the function clade_analysis_updated. Re-running the main analysis without the bug reduces the Bayes factors by a factor of six. This significantly reduces confidence in the authors' findings.

The sensitivity analyses should be re-run without the bug (this isn't feasible without the data, which have not been published).

There are other deficiencies. In particular, the different hypotheses are tested against different evidence (more specifically, against different conditions derived from the evidence), so that the Bayes factors are heavily biased in favour of the multiple spillover hypothesis. This issue will be addressed in a separate comment.

Explanation

The Bayes factor calculations require likelihoods of the observed data being produced by the different hypotheses. For the single introduction hypothesis, this is evaluated by generating 1100 simulated viral phylogenies and counting those with topologies that are deemed compatible with the observed data. The compatible topologies are denoted CC and AB and correspond, respectively, to the center and right topologies in Figure 2 of the paper (reproduced below).

As shown in Figure 2, the CC topology has two clades on branches with one mutation from the MRCA ('one-mutation clades'). The AB topology has a clade on a branch with two mutations from the MRCA (a "two-mutation" clade).

For each simulated viral phylogeny XXXX, the details of the one- and two-mutation clades are collated by the script stableCoalescence_cladeAnalysis.py into the files XXXX_clade_analysis_CC_polytomy.txt and XXXX_clade_analysis_AB_polytomy.txt, respectively. The function clade_analysis_updated extracts these details to clade_analysis_CC_d[XXXX] and clade_analysis_AB_d[XXXX]. The details list each clade's size (under the label 'clade_sizes') and the sizes of its subclades (under the label 'subclade_sizes).

Note that the one-mutation clades include all clades on branches with at least one mutation from the MRCA. Likewise, the two-mutation clades include all clades on branches with at least two mutations from the MRCA. This means that the two-mutation clades are a subset of the one mutation-clades. However, the details of the two-mutation clades will not normally be stored at the same place in the lists of clade_analysis_CC_d[XXXX] and clade_analysis_AB_d[XXXX].

A simulated viral phylogeny is deemed to have an AB topology if it has:

a specific size - the number of taxa in the two-mutation clade must be between 30% and 70% of the total number of taxa in the whole phylogeny;
a basal polytomy - at least 100 clades must descend directly from the MRCA; and
a polytomy at the two-mutation clade - at least 100 subclades must descend directly from the two-mutation clade root.

These requirements are tested under # A/B analysis in the function clade_analysis_updated in the notebook cladeAnalysis.ipynb. The relevant code is reproduced below

    # A/B analysis
    ab_count_30perc = 0 # interested in 2 mutations clade that are at least 30% of all taxa
    ab_count_30perc_polytomy = 0 # interested in 2 mutations clade that are at least 30% of all taxa + has a basal polytomy
    ab_count_30perc_twoPolytomies = 0 # interested in 2 mutations clade that are at least 30% of all taxa + has a basal polytomy + polytomy at 2 mutation clade
    lower_constraint = 0.3 # the 2-mutation clade must be at least 30% of all taxa
    upper_constraint = 0.7 # the 2-mutation clade must be at most 70% of all taxa
    
    for run in clade_analyses_AB_d:
        num_leaves = sum(clade_analyses_CC_d[run]['clade_sizes'])
        base_polytomy_size = len(clade_analyses_CC_d[run]['clade_sizes'])
        clade_sizes = clade_analyses_AB_d[run]['clade_sizes']
        subclade_sizes = clade_analyses_CC_d[run]['subclade_sizes'].copy()
        if not clade_sizes: # no 2 mutation clades
            continue
        if max(clade_sizes) >= lower_constraint*num_leaves and max(clade_sizes) <= upper_constraint*num_leaves: # clades match size restrictions
            if len(clade_sizes) == 1:
                ab_count_30perc += 1
                if base_polytomy_size >= min_polytomy_size: # basal polytomy
                    ab_count_30perc_polytomy += 1
                    if len(subclade_sizes[0]) >= min_polytomy_size: # two-mutation clade has polytomy
                        ab_count_30perc_twoPolytomies += 1
            else:
                clade2 = sorted(clade_sizes, reverse=True)[1]
                ab_count_30perc += 1
                if base_polytomy_size >= min_polytomy_size: # basal polytomy
                    ab_count_30perc_polytomy += 1
                    max2mutCladeLoc = clade_sizes.index(max(clade_sizes))
                    if len(subclade_sizes[max2mutCladeLoc]) >= min_polytomy_size: # two mutation clade has polytomy
                        ab_count_30perc_twoPolytomies += 1

Although the code is somewhat messy, it can be seen that the size of the polytomy at the two-mutation clade is determined in two different ways:

if there is only one two-mutation clade (i.e. len(clade_sizes) == 1), the size of the polytomy is taken as len(subclade_sizes[0]); and
otherwise, the index of the largest two-mutation clade is determined (i.e. max2mutCladeLoc = clade_sizes.index(max(clade_sizes)) and the size of the polytomy is taken as len(subclade_sizes[max2mutCladeLoc]).

It can also be seen that clade_sizes is from the two-mutation list (i.e. clade_sizes = clade_analyses_AB_d[run]['clade_sizes']), but subclade_sizes is from the one-mutation list (i.e. subclade_sizes = clade_analyses_CC_d[run]['subclade_sizes'].copy()).

This means that, rather than testing if the two-mutation clade has a sufficiently large polytomy, the code instead tests a one-mutation clade that is randomly selected, depending on how the clades are ordered in clade_analysis_CC_d[run] and where the largest (or only) clade was located in clade_analysis_AB_d[run].

This does not agree with he AB topology requirements described in text and the supplementary materials. Moreover, it makes no sense. There can be no valid reason for introducing randomness in this way. Rather, this seems to be the result of a simple copy-paste error, where subclade_sizes = clade_analyses_CC_d[run][ 'subclade_sizes'].copy() should have been changed to subclade_sizes = clade_analyses_AB_d[run]['subclade_sizes'].copy().

Verification

The bug can be reproduced with code and data from the GitHub repository.

wget https://raw.githubusercontent.com/sars-cov-2-origins/multi-introduction/main/notebooks/cladeAnalysis.ipynb
wget https://github.com/sars-cov-2-origins/multi-introduction/blob/71ed420fe11ecdbe589568255ec90ca56d6e221c/FAVITES-COVID-Lite/cumulative_results/clade_analyses_CC.zip
wget https://github.com/sars-cov-2-origins/multi-introduction/blob/71ed420fe11ecdbe589568255ec90ca56d6e221c/FAVITES-COVID-Lite/cumulative_results/clade_analyses_AB.zip
unzip simulations_cumulative_results/clade_analyses_CC.zip
unzip simulations_cumulative_results/clade_analyses_AB.zip

After launching the notebook, it may be necessary to comment out any unnecessary imports that cause errors, e.g.

#import treeswift
#from utils import *
.
#import seaborn as sns

The second and sixth code cells must be updated so that clade_analyses_CC_dir and clade_analyses_AB_dir point to the newly unpzipped clade_analyses_CC/ and clade_analyses_AB/ directories, e.g.

clade_analyses_CC_dir = 'clade_analyses_CC/'
clade_analyses_AB_dir = 'clade_analyses_AB/'

The seventh and eighth code cells may be deleted, because they require data that has not been published (else they may be ignored when they throw errors).

Re-running the notebook should reproduce the published results of the main analysis.

subclade_sizes_correct can then be corrected to use the two-mutation list.

    for run in clade_analyses_AB_d:
        num_leaves = sum(clade_analyses_CC_d[run]['clade_sizes'])
        base_polytomy_size = len(clade_analyses_CC_d[run]['clade_sizes'])
        clade_sizes = clade_analyses_AB_d[run]['clade_sizes']
####### CC has been changed to AB in the line below###########################    
        subclade_sizes = clade_analyses_AB_d[run]['subclade_sizes'].copy()

Re-running the notebook should now produce the intended results of the main analysis.

Correcting the error increases the likelihood of a single introduction producing the AB topology from 0.5% to 3%, and reduces the Bayes factors from ~60 to less than 10.

$t_1$ \ $t_2$	0 days	1 day	2 days	4 days	7 days	14 days	28 days
0 days	0.21	0.22	0.23	0.23	0.23	0.21	0.16
1 day	0.23	0.24	0.25	0.25	0.25	0.21	0.17
2 days	0.24	0.24	0.24	0.25	0.24	0.22	0.17
4 days	0.25	0.26	0.25	0.26	0.26	0.22	0.16
7 days	0.26	0.27	0.26	0.26	0.26	0.21	0.16
14 days	0.27	0.27	0.26	0.26	0.24	0.21	0.16
28 days	0.24	0.24	0.24	0.24	0.23	0.19	0.14

$t_1$ \ $t_2$	0 days	1 day	2 days	4 days	7 days	14 days	28 days
0 days	0.16	0.17	0.17	0.18	0.18	0.15	0.05
1 day	0.17	0.17	0.17	0.18	0.18	0.15	0.05
2 days	0.18	0.18	0.18	0.18	0.18	0.14	0.04
4 days	0.19	0.18	0.18	0.18	0.17	0.13	0.04
7 days	0.18	0.18	0.18	0.17	0.16	0.11	0.03
14 days	0.14	0.14	0.13	0.12	0.10	0.07	0.02
28 days	0.05	0.04	0.04	0.04	0.03	0.02	0.00

$t_1$ \ $t_2$	0 days	1 day	2 days	4 days	7 days	14 days	28 days
0 days	0.16	0.16	0.17	0.17	0.18	0.15	0.05
1 day	0.17	0.17	0.17	0.17	0.18	0.15	0.05
2 days	0.18	0.18	0.18	0.18	0.18	0.14	0.04
4 days	0.18	0.18	0.18	0.18	0.17	0.13	0.03
7 days	0.17	0.18	0.17	0.17	0.15	0.11	0.03
14 days	0.13	0.13	0.13	0.12	0.10	0.06	0.02
28 days	0.05	0.04	0.04	0.04	0.03	0.02	0.00

Explanation

Verification

Verification

Erratum a year ago - See details here

Discrepancies

The stable coalescence

The primary case state

Minor errors

Noise

Remedies

Limitations

Erratum a year ago -