Deep Learning for Molecular Epidemiology

Deep Learning for Molecular Epidemiology

Molecular epidemiology is based on the phylogeny of pathogen strains (eg HIV strains) taken from the host population (eg a given country or risk group). This phylogeny is constructed by now standard methods, from the genetic sequences of the virus or bacterium studied. Each leaf of the phylogeny corresponds to a strain taken from a given patient, and each node of the tree corresponds to the transmission of the pathogen from one patient to another patient. Using the dates of sampling of the stumps, we can date all the nodes (or transmissions) of the tree. These data, which are easily acquired using modern sequencing methods, are richer than the classic prevalence data, since they provide information on transmissions between patients (Figure 1). They are widely used to study the spread of epidemics, for example Ebola or SARS-CoV-2, or even tuberculosis. In particular, they make it possible to study the spread of epidemics from one region to another, how quickly patients are sampled and treated, or whether transmission is faster in certain subpopulations. These results help to study epidemic outbreaks, to compare the impact of health policies and to design new ones. These approaches were widely used during the SARS-CoV-2 epidemic, with numerous phylogenies published in the mainstream press showing the appearance of new variants on the surface of the globe.

More recently, “phylodynamics” has been developed, the objective of which is to integrate classic epidemiological models, based on prevalence data, into a richer phylogenetic context where transmission trees are available. The difficulty of these approaches is mathematical. With the exception of the simplest models, we do not have simple mathematical expressions to calculate the likelihood of the data and to estimate the parameters of the model. The authors of the publication have based themselves on a radically different approach, which is based on the joint use of simulations and learning using deep neural networks. This type of approach is found in very different fields, such as weather forecasting for example. The model is not mathematically dissected, but simply used to simulate a large number of datasets corresponding to different values ​​of the parameters. In a second step, a neural architecture is used to learn from simulated data (for which we know the value of the parameters) to predict the values ​​of the parameters of real data. These architectures thus achieve a form of non-linear interpolation between known simulated situations. The learning phase is computationally heavy, because a lot of data has to be simulated. But the prediction phase is extremely fast, which is key here because the major objective is epidemic surveillance.

The difficulty with this approach in the context of molecular epidemiology is that the data is a phylogeny or a tree. However, the usual neural architectures offer as input a vector (or series) of real numbers, it was therefore necessary to code the phylogenetic trees in the form of vectors, this coding being best suited to learning. This work, at the heart of Jakub Voznica’s thesis, consisted in testing several classic codings, without success, to finally propose a new high-performance coding with a convolutional neural architecture, of a type close to the architectures that made the success of deep learning in image analysis. With this coding and this architecture, the results are more precise than those obtained with the classic Bayesian methods, which constitute the reference in the field but which are very heavy in computation time (several days), even with limited data (a few hundred pathogen sequences). With the approach published and implemented in the “PhyloDeep” software, it is possible to analyze phylogenies covering thousands of sequences in a few minutes. This software has been successfully applied to data taken from MSM (Men having Sex with Men, or men who have sex with men) from the city of Zürich. PhyloDeep demonstrated the existence of a subpopulation (the super spreaders), of limited size, but having a major role in the spread of the epidemic, due to the frequency and number of its partners.

FIGURE 1: HIV-1 transmission tree among MSM in the city of Zürich

Phylogeny of 200 strains of HIV-1 taken from the MSM population of the city of Zürich. Each leaf corresponds to a stump, each node of the tree to a transmission between two patients. This tree is dated, the concentric scale (from 0 to 40) is in months. The visual analysis of the tree shows that two transmission modes coexist: (1) regular transmissions (for example in the blue circle) and other (2) much faster transmissions (for example in the red ellipse). PhyloDeep predicts that around 8% of the population are super-spreaders and transmit the virus almost 10 times faster than the rest of the population.

FIGURE 2: Tree coding and learning
FIGURE 2: Tree coding and learning

This figure from the publication summarizes the tree coding (top) and the convolutional neural network (bottom) used to learn from this coding. The trees are thus reordered (the deepest leaves are on the left) then transformed into vectors by measuring alternately the distance to the root of the leaves and the nodes. This so-called CBLV (Compact Bijective Ladderized Vector) coding makes it possible to reconstruct the tree and therefore does not lose any information. The architecture takes as input a tree coding and gives two types of output, either the best model for the considered dataset (for example a super-spreader type model), or the estimated parameters for these data in the model chosen by the user. This architecture is a relatively standard form of convolutional network, which extracts the salient characteristics of the tree, followed by a classic network of the multilayer type (or FFNN for Feed Forward Neural Net).

#Deep #Learning #Molecular #Epidemiology

Leave a Comment

Your email address will not be published.