Journal Club: Doing Epidemiology, but with web searches
My first look at an industry article and a primer on classification
The currency of the research world is the journal paper: a brief, write up on a breakthrough in science. While papers are most often associated with university professors, they are not the sole engines of research. Often times, companies will come face to face with problems that haven’t been tackled before. Once they have a solution (and it’s not revealing valuable trade secrets), it’s often to their benefit to write up an article and submit it just like any other professor.
From my experience, university papers often feel too obscure, solving a problem only relevant to insiders of the field. In contrast, papers coming out of industry have a more “applied” feel where the problem makes more sense to laymen. Of course, this isn’t a catch-all; industry papers may also solve an obtuse problem in a research field that most will not appreciate. My post-Ph.D sights are set squarely on industry afterwards, so it’s in my best interest to start looking at these papers too.
This week’s paper comes from Google! While it’s definitely a company dedicated to search, a lot of interesting health research have spawned from their service. Back in 2008, Google launched the Google Flu Trends service. The idea behind this project was that people who started to feel flu-like symptoms would be more likely to search for terms related to the flu. Since Google by far has the highest market share of online searches, Google researchers surmised that they might be able to predict the incidence of the flu in a given population based on how many people were doing these types of searches. The project had mixed results, but it really highlighted an incredible data source: people’s search history.
This week’s article has a similar flavor. Researchers at Google developed an algorithm that predicts the incidence of Lyme disease in the United States. Admittedly, this was the most approachable paper among Google’s publications I found that I could comfortably talk about, and I hope I can teach you something as well.
It’s time for Lyme.
Why Lyme Disease?
The choice of Lyme disease is specific here. The authors note that the incidence of Lyme disease has been drastically underestimated in the US. At one point, people thought that about 30,000 Americans had it each year, but after a more careful study, the CDC found that closer to 300,000 Americans had the disease. That’s an insane underestimation.
The Lyme Disease problem highlights a core question in epidemiology: how many people are affected by a particular disease in a given time frame? For those who do not know, epidemiology is the study of the who, when, where and why of diseases. It’s very closely related to biostatistics, and it’s common for the two fields to rub shoulders. Epidemiologists typically understand how common a disease is from regular surveillance; hospitals send tests to laboratories which log positive and negative results, and the CDC compiles data from local and state-level health departments to get a final count on disease numbers.
This is the ideal process, and there are many places where counting can go awry. The authors take special care in describing the individual difficulties associated with Lyme disease. Each state has its own practices in logging and classifying Lyme disease cases, which lends itself to inconsistent reporting. Furthermore, each state also reports at different times, making it more difficult to understand the state of the disease at any given time point. Finally, the authors note that the CDC only releases final figures two years after all the data is aggregated. This delay makes it difficult for health departments to react quickly to surges in the disease and for epidemiologists to understand just where the disease is hitting the hardest.
These difficulties highlight the need for an alternative approach to understanding disease incidence, one that can happen in real time. What was the approach used by these Googlers (and Bostonians)? Analyze everyone’s web search history.
How did they do it?
At its core, the paper presents a model they’ve created and how well this model performed. With these types of papers, it’s important to ask yourself a few things:
What kind of model was used? What’s the input and output?
How did the team train the model?
What did the paper use to define model performance? Is it clear what it means when a model is “good”?
This section will lightly cover each of these questions.
Estimating a probability based on words
At the very end of the paper, the authors state that the model, Lymelight, was a log-linear maximum entropy model. This model takes in Google search queries (aka the things we write at the top of Google Chrome) and outputs a predicted probability of this query being related to Lyme disease (aka a number between 0 and 1).
You don’t really need to know the deeper details about the “log-linear maximum entropy model” other than the fact that it was a type of classification model, as evidenced by its output. Classification models are another type of model commonly used by statisticians and data scientists. Given some data, the model tries to guess if the data belongs to one group versus another group. In this case, a query is either “Lyme disease related” or not; these are the two “groups”.
Aside: I find there’s a semantic weirdness when people try to compare these two positions, especially when they both must learn almost the same material. For the longest time, I thought data science was an over-hyped term, but I’ve grown to see that the term has some value. That’s an opinionated article for another day.
It’s worth taking some time to understand the input to the model: search queries. We’re getting into the realm of natural language processing, which I only have light exposure too, so I’ll do my best to explain. In the end, search queries are just words. How do you include language in a mathematical model? The short answer is something called an indicator function. An indicator function can only take two values — 1 and 0 —, and it will take the value 1 when a particular condition is fulfilled.
You can code the presence of particular words using indicator functions, and that’s what’s done here. Many, many words (~50,000) have been specially chosen from a massive collection of search queries and were placed in a model as indicator functions. Given a query, you split it up into words and then make note of which words are present. Then, you simply “turn on” each of the indicator functions associated with those words. I should note here that there was more information used to train the model other than the words of the query itself (ie what search results were returned and which results were clicked on).
Now that we know about their model, we’ll see how it was trained.
When researchers “train” a model, it’s somewhat similar to how humans learn a new pattern. In school, we ideally do some homework to practice a new skill, and then we learn where we went wrong via an answer. With enough problems, we get better at the skill until we can not only do practice problems reliably but also problems we may not have seen before. This process involves learning all sorts of patterns and rules that need to be applied to different situations.
To this end, I admittedly am not entirely clear on how Lymelight was trained — as in, I do not know about the practice problems Lymelight was given. The model is described as a “supervised machine-learned classifier“. The key word here is “supervised”, which is a fancy way of saying that the machine had access to the correct answers (whether or not a query was Lyme disease) about each query.
So what kinds of queries were considered Lyme-disease related and which weren’t? Positive examples (aka related) were those that led to very specific pages that had a lot of information about Lyme disease (ie a Wikipedia page or the CDC page on Lyme disease). Negative examples were taken from random queries in the dataset (aka these queries did not lead to significant web pages).
From here, it’s all about math. Once Lymelight was able to classify a query as “related” or “not related” to a reasonable degree, it could start classifying at a large collection queries. After this, queries could be aggregated by county using spatial data associated with each. This detail was unclear to me, but I would hazard that this has to do with IP addresses and your Internet provider.
After Lymelight produced its incidence estimates, the researchers looked to the CDC statistics on Lyme disease for 2015. The CDC figures act as the “ground truth” since they are verified with each state and county. They compare what Lymelight predicted based on the search query data and produce a summary value of how much these predictions match up against the “truth”.
The researchers also evaluated how well Lymelight’s predictions for 2014 incidence can predict incidence for 2015. This was done to make sure that Lymelight had stable predictions over time.
What did they find?
In short, the model performed extraordinarily well, considering it was just trained on people’s search history!
The paper reports that Lymelight’s predictions had a 92% correlation with the official CDC county statistics with a p-value of less than 0.0001. This p-value represents the probability of observing a correlation that high, assuming there was none at all in the first place. In other words, it looks like the model produces some very good predictions that aren’t just due to chance. I’m surprised correlation is used here because typically with classification problems, you want to know about the percentage of misclassifications. In this case, the value of interest is disease incidence, which is a number instead, so I can see the justification here.
In terms of how well Lymelight’s 2014 figures help to predict 2015 incidence numbers, the model also performs well here. The researchers report a prediction error of 0.0001571, which if you can’t see the decimal places, is small.
This paper was an exciting one to read, and it really highlights how powerful new technologies can be towards aiding public health causes. I can’t even count how many times I use Google in a day, but it’s cool to see that this behavior can be learned from in the aggregate.
Another thing I thought was great about this article was that the authors acknowledge that this modern “machine-learned epidemiology” should be thought of as a complement to current surveillance methods, not a replacement. While I am typically optimistic about how much value computers can contribute to human health, I am constantly reminded that these algorithms have errors and are imperfect. When a life is on the line, I don’t think humanity is at the point where it can completely trust a computer yet, but it can still be an invaluable tool.
Some critics may be wary of the privacy concerns of such a tool like Lymelight. The authors go out of their way to highlight that the data was all de-identified in a way that specific users could be linked back to their search history. While I certainly have nothing to hide from mine -cough-, there are others who might, but they can rest assured that their searches serve to help rather than hurt. Privacy is important, especially with health issues, and it’s important for authors to address as many issues as they can in their model building process.
Here’s to more industry articles in the future!