This article is second half of a series discussing what statisticians do, from a graduate student perspective. Part I is here, and I’ll be using some of the same metaphors from the last one.
It’s week 9 here in Winter Quarter, and it’s about time to buckle down for finals. As such, I am fully locked in to writing these articles for you instead. I’m following on my last post about what statistics is, and I’ll be continuing that thread here.
In the last post, we discussed how statisticians are essentially the Sherlocks of the research world. Statisticians view data as being produced by a metaphorical “machine” that follows some fixed rule. We brought up this idea of the probablist who knows what this rule is; this rule is described by some probability distribution function. The statisticians “sees” randomness produced by the machine, but assumes that this randomness follows a fixed law. The ultimate job of the statistician is to give their best guess at what the original machine is.
For this week, we’ll extend the metaphor a bit to cover associations. Associations are described in terms of changes: if one variable X increases, does another variable Y increase or decrease too? Maybe it doesn’t change at all. Over time, statisticians have developed tools to help us understand how two phenomena are related to each other. These tools are called regressions.
Quick historical aside: “Regression” here is not the same as dictionary definition. The term comes from the phrase “regression to the mean” by Sir Francis Galton when he studied the relationship between the heights of parents and their children. He noticed that taller parents often produced shorter children and vice-versa; in other words, the kids “regressed” towards a more typical height relative to their parents. Today, statisticians take this term and use it to refer to the study of relationships between phenomena.
Introducing Abby and Billy
In the original article, our metaphorical probablists create machines that produce random data according to some law aka a probability distribution function. We will call this first probablist Abby. Now, we introduce a second probablist who is somewhat less creative than his more enterprising peers. We’ll call him Billy. I resist the urge to call them Alice and Bob.
Instead of creating his own law for generating data, Billy decides that he will base his machine on Abby’s machine instead. For the purposes of this article, his rule will be simple. Whatever he data he observes from Abby’s machine, he will double it… and some other stuff.
Like all of us, Billy is not perfect. He does his best to perfectly double all of the numbers that Abby’s machine produces, but he messes up…. all the time really. He can never perfectly double Abby’s numbers, but he at least has some control over his process. Billy is at least consistent with how much he messes up. His will be stay within a certain range of values, say between any decimal between -2 or 2, from the actual doubled value:
Based on what we’ve learned so far, it’s perfectly acceptable to say that Billy does have a rule for creating his data. Unlike Abby who’s numbers come from a probability distribution, Billy’s numbers come from a sort of “imperfect” function: he will always double Abby’s numbers, plus or minus some random error. The fact that Billy’s imperfections are random, but constrained is very important, which we will learn in a bit. To tie this all with more familiar notation, we’ll write Billy’s numbers as an “explicit” function of Abby’s numbers.
This is all well and simple in the world of the probablists, but remember that we live in the world of the statistician.
What the statistician sees
The statistician only sees the numbers that both Abby and Billy produce (aka the brown and orange boxes). These boxes always come in pairs, so the statistician suspects that there might possibly be a relationship between these pairs of boxes. Can you guess that the statistician is going to try to do?
When the statistician was only dealing with Abby’s numbers (not that they know who Abby is), their goal was to try to make a good guess at what Abby’s “rule” was to produce her random set of data. Now that the statistician has observed both Abby and Billy’s numbers, their new goal is to make a good guess at what Billy’s function is. This might have been easy, but the problem here is that the true relationship (the doubling) between Abby and Billy’s numbers is muddied by imperfections.
There’s terminology for this. If Billy was perfect and his numbers always doubled Abby’s, we would call the relationship between his and her numbers as a deterministic relationship. In other words, Abby’s numbers perfectly determine Billy. In this case, the relationship is still there but there’s some added “noise” to Billy’s numbers. This imperfect relationship is called a statistical relationship. This naming is on purpose, since it’s the job of the statistician to try to separate the true association from the imperfections.
We will not delve into how a statistician will do this (tl;dr: it’s regression), but my aim here was to explain another fundamental problem in statistics: making educated guesses to associations between phenomena. I always fear that the metaphors miss the essential point, so we will look at some concrete examples.
Grounding the metaphor
Abby and Billy’s numbers are really just arbitrary for the purposes of the metaphor, but your perspective can really change once they start becoming real-world examples.
Let’s say Abby’s numbers were the number of cigarettes a person smoked, and Billy’s number represents the number of asthma attacks they have. What interpretation does Billy’s function have? We might interpret that as how many more asthma attacks are associated with more cigarettes smoked.
Billy’s “imperfections” gain an interpretation here too. Cigarettes are not the only thing that give people asthma attacks. Maybe some people have exercise-induced asthma or others have excellent lung health. These extra factors can modify how much more cigarettes are associated with asthma attacks.
The doubling represents that “true” association that cigarettes has with asthma attacks. And this is usually what researchers are interested in assessing. Whenever you hear about a new scientific study, try to figure out what item represents “Abby’s numbers” and which represent “Billy’s numbers”. To use more scientific terms, Abby represents the independent variable that we want to investigate, and Billy represents the dependent variable or outcome that we think is associated with with it.
I think it’s good to note that there was nothing really “special” about the fact that Billy doubled Abby’s numbers. That was just for easy computation. The sign of the numbers has an important meaning. Doubling in the cigarette-asthma example means that more cigarettes means more asthma attacks. This is a positive association. Likewise, in a nonsensical world where more cigarettes reduced asthma attacks, the slope in the function would be negative, producing a negative association. What might be the most important interpretation is no association at all! This corresponds to a slope of zero.
As we wrap up this article, I want to give readers a slight preview at what I study as a Ph.D student. It’s still essentially the same thing as I described here in the article. With one caveat. In Abby and Billy’s example, we would say that there is a linear relationship between their numbers. So, we call the process of making a good guess to function as a linear regression. Humans think in linear terms, and linear regressions are well-studied. Smart statisticians have showed that we can make very good guesses about linear regressions, and that with more data, our guesses will be extremely close to the true association.
But things are not always linear. In fact, linearity is often an assumption that statisticians make, and in many cases, this assumption is very, very bad. Modern statistics deals with making sure that when we make guesses about non-linear relationships, that these guesses are still valid and will still end up recreating the true relationship in the end. The Billy that I deal with look somewhat like the following, where his function can be as wild as he wants.
For the ultra-nerdy that want a name to what I’m studying, it’s called Mathematical Statistics. A lot of the material I learn is called non-parametric regression, and is concerned with the estimations of entire functions. Functions have literally an infinite number of ways that they can be expressed, so you can imagine that the mathematical background needed to make sure your guesses are correct is much more rigorous.
Hope this was interesting to you, see you next week!