Machine Learning to Understand How Infectious Disease is Operating in a System

Machine Learning to Understand How Infectious Disease is Operating in a System

I use machine learning to try and understand how infectious disease is operating in a system and whether we can make predictions about the types of organisms that present threats or high risk for infectious disease that might come from the environment.

Machine learning is a way of using data to discover patterns and processes. It’s sort of like data methods that allow for pattern recognition and drawing information out of large volumes of data. It’s useful in ecology because oftentimes in ecology we have a lot of data that are imperfect in many ways, you know there’s multiple hidden interactions that are unknowable in advance and so oftentimes it’s hard to control
for those interactions.

There are multiple data streams that are giving us various amounts of information about different aspects of a system and so having a way to combine them into one big analysis is really useful for gleaning more information out of it than would be possible through just looking at single or just a few data streams at once, and so it gives us sort of a more holistic understanding.

And also I think one of the strengths of machine learning is that it’s a little bit agnostic to process. It doesn’t assume that there is an underlying process, it assumes that it’s kind of unknowable, and instead it takes more of a top-down approach where it looks at the data first and it identifies patterns that are strongly present in those data, and in that process sort of offers up hypotheses to the researcher for what
might be going on in that system to generate those patterns that you are seeing in the data.

So it’s kind of inherently data-driven in that way and sort of model-free. So machine learning has been applied in ecology in a lot of really interesting and creative ways. Some of my favorite examples are using machine learning to try and understand what types of organisms might be most at risk for extinction and how different groups of organisms are differentially vulnerable to extinction.

That’s work by Ana Davidson and colleagues that was published in PNAS a while ago. There’s another group of studies that thinks about invasion potential for different species, particularly in plants. There’s a lot of data to suggest that some plants are more likely to invade new areas compared to others, and some colleagues at University of Georgia, John Paul Schmidt and John Drake have done some really interesting work to try and understand what traits are predictive of invasive plant species and whether it’s possible to assign a value to controlling those types of risks posed by invasive plants.

And the work that I do has really focused on infectious disease and trying to understand what types of organisms are most likely to give rise to infectious diseases, so what types of pathogens are more likely to infect humans compared to others, what types of mammals are more likely to carry those types of diseases and transmit them to humans, and what types of vectors, like tics and mosquitoes, what is it about those that vector human infectious pathogens to us that distinguish them from the tons of species that there’s no indication that they’re carrying anything or vectoring anything that’s harmful to humans?

So distinguishing those groups of organisms is important for public health, but also ecologically just really interesting to think about. I think some of the major strengths of applying machine learning approaches in ecology is that it allows us to, as ecologists, draw from our domain expertise in understanding, you know, having the sort of background understanding of what drives fitness, evolutionary strategies, life history strategies, how ecosystems function, what happens when we perturb them?

Taking all of that domain expertise and matching it up with data, patterns that are offered up by the machine learning algorithms from the data themselves, I think it generates a couple of things. One, it allows us to really take advantage of the data streams that are available, and secondly, it allows the algorithms to sort of systematically offer up hypotheses that are then testable by ecologists.

So rather than the sort of traditional approach which is to make observations on the landscape and then think of hypotheses intuitively for why those things might be happening in the landscape, instead we start with a whole bunch of data that are increasingly available at our fingertips and generating patterns from those data that are suggestive of hypotheses that are generating those data patterns and then we can go out
and test those hypotheses.

So one example of how we have applied machine learning to try and understand infectious disease systems is to think about rodents. So as a group rodents have over 2,000 species in general and about 200 of them are carrying between one and 11 different infectious parasites or pathogens, so these are parasites and pathogens that can infect humans and cause disease in humans, so these are zoonotic pathogens.

So we wanted to understand, you know, what is it about those 200 that carry diseases that distinguish them from the remaining 2,000 species? And so we trained the machine learning algorithms on just these intrinsic variables that distinguish one species from another, things like how many litters they have per year, how soon they reach sexual maturity, what they eat, how social they are, what’s their geographic range, are they diurnal or nocturnal and, you know, these things that are really sort of fundamental to distinguishing one species from another.

And we asked the algorithm to learn the features of species that carry many zoonoses, and when we trained the model to do that it gave us a really interesting picture of what a rodent reservoir looks like, and it turns out that rodent reservoirs tend to be things with a faster life history strategy, so things that reach sexual maturity early, they have larger litter sizes than the majority of other rodent species, and rodent reservoirs also tend to have geographic ranges that are depauperate of mammal diversity, so there are many fewer mammals that are in the geographic ranges of rodent reservoirs compared to rodents that don’t carry any diseases.

And so all of these traits come together to form a profile of what makes a good rodent reservoir, and it was really satisfying to see that it sort of corroborates a lot of the independent field studies that have been done by scientists that are looking at specific systems, specific co-species infected with specific parasites in a particular location, and when you gather up those lines of evidence a bigger picture emerges of why some rodents might be able to carry more zoonotic pathogens compared to others.

It also suggests some nice hypotheses that can be tested and follow up studies in particular systems or in a comparative way, looking at hosts that differ in particular traits that are deemed to be important by the machine learning model, and we want to understand why and how those things are contributing to the number of zoonoses that they carry.

You can do this type of analysisfor lots of different problems, not just infectious disease, I think that machine learning has huge potential for the field of ecology and has a lot to offer for forecasting in general.

Share this post ...

Leave a Reply

Your email address will not be published. Required fields are marked *