The most important ideas in modern statistics
Statistically significant
This blog post is a lightly edited version of a video I posted on Youtube. If you’d like to watch instead, go for it! Otherwise, feel free to continue reading.
Most people only ever interact with statistics for a limited part of their lives. But statistics is a field of research. Like other areas, statistics has evolved. Influential ideas have come and changed the trajectory of statistics. As a student of biostatistics, it's my responsibility to be familiar with these revolutionary ideas in the field. In this article, we'll talk about 8 innovations in statistics that shaped how we know it today, and I'll do my best to explain what these innovations were, and why they were so impactful in a way that makes sense to a general audience.
In 2021, Andrew Gelman and Aki Vehtari published an article in the Journal of the American Statistical Association, or JASA. JASA is one of the most prestigious journals in the field of statistics, so publishing here is a big deal if you’re a statistician.
But, instead of a research manuscript, Gelman and Vehtari publish an essay. This essay is titled, “What are the most important statistical ideas in the past 50 years”. The article considers statistical innovations that happened from around 1970 to 2021, so this is the time period for which I call “modern statistics.” Maybe you have a different definition, but this time period will be the focus of the video.
Counterfactual Causal Inference
In an ideal world, all data comes experimental data, where a research can control who receives an intervention and who doesn’t. But we live in the real world, and the real world gives us observational data, where we can’t control who receives a treatment and who doesn’t. “Treatment” is a term more reserved for experimental settings; in observational settings, we more commonly refer to them as exposures. We can still perform statistical analyses on observational data, but we cannot make causal claims from them. Only correlational claims.
That was until the counterfactual causal inference came onto the scene. This framework allows us to take observational data, and make adjustments in ways that get us closer to causal statements. How this works is the topic of an entire video, so I’ll give you the basic breakdown.
Let’s consider a studying example. Let’s consider a world where I have an upcoming test. I can choose to study a bit more for it or I can choose not to. In this reality, I choose to study and I get some score on my test. If a supernatural statistician wanted to know if this decision caused a change to my score, they would have to examine another reality. They would have to find the reality where I didn’t choose to study, and measure the test score of that version of me who didn’t study.
The only difference between these two versions of me is that I chose to study in one reality, but not in another. This unobserved version of myself is called the counterfactual, because what happened to this version of me is “counter” to what actually happened. Then, the causal effect of me studying on my test score is the difference of these two outcomes. The fundamental problem in causal inference is that we can only observe one reality and therefore one outcome. In essence, it’s a missing data problem.
The counterfactual framework is important because it helped give statisticians a way to formalize causal effects in mathematical models. Under some assumptions, statisticians can adjust for variables that muddy the exposure-outcome relationship and get closer to a causal statement, even if the data is observational. This is huge because several fields of study are more prone to having observational data, including economics and psychology. This includes tech companies, who have a vested interest in estimating causal effects in their users. It’s internship season right now for PhD students, and many of these companies list causal inference as a required qualification.
Bootstrapping
If you’ve been with my channel for a while, you may be familiar with this one already. Bootstrapping was the topic of my Summer of Math Exposition entry. That video delves more into technical detail about the bootstrap, but I’ll briefly explain what it is in this video.
The bootstrap is a general algorithm for estimating the sampling distribution of a statistic. Ordinarily, this would require gathering multiple datasets, which no one has time for, or it would require a mathematical derivation, which I don’t have time for. Rather than do either of these, the bootstrap takes the interesting approach of reusing data. From a single dataset, the bootstrap generates several “bootstrap datasets” by sampling from replacement from the original. For each of these bootstrapped datasets, a statistic of interest is calculated, and their distribution can be derived from this entire collection. This is incredibly valuable because not only is it super simple and therefore easy for more people to use, it’s applicable to many kinds of statistics. We can use bootstrap to create confidence intervals for point parameters like a regression coefficient, and we can also create confidence bands for more coefficient functions, like we might see in functional data analysis.
Another example of simulation-based inference comes from Bayesian statistics. Bayesians encode knowledge in the form of priors, or probability distributions on statistical parameters. Using these priors, we can actually simulate data from a prior distribution and check if the resulting data looks like it makes sense for actual data we collect. This is called a prior predictive check. The same can be done for the posterior distribution of a parameter, which makes it a posterior predictive check. These are incredibly useful for validating our models.
Overparameterized Models & Regularization
To understand this idea, we need some more context on statistical parameters. One way to view parameters is that they are representations of ideas that are important to us within statistical models. Another way to view them is as a mechanism for increasing model complexity, which enables it to better approximate more complicated phenomena.
Neural networks are a prime example of this. Each node in a neural network is associated with a parameter, or weight, along with some extra bias parameters. We can easily overparameterize by making these networks very large, and by doing so, the universal approximation theorem tells us that these networks can approximate a wide variety of functions. More parameters, more flexibility.
And this extra flexibility is important because it lets us model a wider range of phenomena that simpler models just can't handle.
One problem with with extremely flexible models is that they may start to approximate the data itself, rather than representing a more general phenomena we can learn from. To combat this, statisticians employ regularization techniques which help to balance out this complexity by enforcing that these models maintain some degree of simplicity.
Bayesian Multilevel Models
I've actually already mentioned this type of model, but they're so useful that Prof. Gelman and Prof. Vehtari dedicate an entire bullet point to it. Multi-level models, also known as hierarchical or mixed-effects models, are models that assume additional structure over the parameters. For example, multi-level models are commonly used to aggregate several N-of-1 trials together. Each individual is associated with their own treatment effect, which we'll denote theta-j, to indicate that each individual has a theta. These individuals form the second-level of the model. The first-level can be thought of as describing the distribution, or structure, of these individual-effects. In an N-of-1 context, the first-level might be a normal distribution, centered at some population-level treatment effect theta, with some variance sigma-squared. Each of the individual effects are generated from this distribution, and each individual's data is generated from that.
In different contexts, the units in the second level of the model could be different things. In a study taking place over many locations, these may be different hospitals or cities, something to indicate a cluster of related units. In a basket trial, each second-level unit is a specific disease, and we suspect that their treatment effects will be similar because they share a common mutation. In meta-analyses, the second level units could be estimated effects from individual research studies! Near the end of this section, Andrew Gelman says that he views the multi-level model as a way to combine different sources of information into a single analysis. This kind of structure is incredibly common in statistics, so that's why multi-level models take a spot on the list.
That being said, multi-level models can be both frequentist and Bayesian, so why is Bayesian specifically mentioned in the article? The authors don't explicitly state why, but my guess is that the Bayesian framework allows us to incorporate prior knowledge into the models. This is especially helpful when deciding on priors on the first-level parameters, especially for the variance. If you choose a wide, uninformative prior, it encourages the resulting model to treat each second-level unit as being independent of the others. On the other hand, choosing a narrow, informed prior allows us to pool data together, which can help estimate treatment effects for second-level units with small sample sizes. Being able to choose different priors gives statisticians much more flexibility in the modeling process.
Generic Computation Algorithms
A recurring theme among the top 8 ideas is the importance of computers and computational power to the development of statistics. Advances in technology have allowed for more complex models to be invented for harder problems To account for this, several important statistical algorithms have been invented to help solve them. An algorithm is just a set of steps that can be followed, and a statistical algorithm is just an algorithm designed to help solve a statistical problem.
But there are so many types of statistical problems out there, that its hard to get an appreciation for how useful the algorithms are. So, I'll explain an example to give you a taste.
One example is the Metropolis algorithm and its more modern descendants. The Metropolis algorithm is interesting because its roots actually stem from physics, as opposed to statistics. The Metropolis algorithm is significant because it lets us generate samples from very complex probability distributions. Random number generation according to some distribution may seem weird, but it's important for statisticians to be able to do so. Bayesian statistics is a branch that makes major use case for an algorithm like Metropolis. The posterior distribution that comes from Bayes' Rule can turn ugly if we move away from conveniences like conjugate families. So ugly that we can't even derive a equation for it. But despite this, we can still generate samples from a complicated posterior thanks to the Metropolis algorithm.
The use of the word "generic" for this bullet point is purposeful. You might think that an algorithm specific to a model might limit its usefulness. But as the Metropolis algorithm shows, it's possible for these algorithms to be decoupled from their original use case and be used in other situations. This genericness gives statisticians many strategies to do the same thing.
Adaptive Decision Analysis
When statisticians design experiments, it used to be a set-and-forget type of thing. Figure out the sample size, and just run the experiment to completion. But midway through the experiment, we may need to stop it. Under a frequentist framework, this would hurt our power and p-value interpretation.
But in modern times, we have a way to do this.
Adaptive decision analysis is the idea that maybe we don't have to wait for the entire experiment to finish. Instead we can "adapt" our experiment based on data we collect in the interim. In the context of clinical trials, we may decide to stop a trial early if preliminary evidence suggests that a treatment sucks. Conversely, if a treatment shows early promise, we may even stop the trial early based on efficacy. But we can't just change the design however we like, these changes have to be decided ahead of time to make sure that we still make good decisions overall.
It's not just in clinical trials where we might need to make multiple decisions. Lots of tech companies like Pinterest and Netflix need to invest in improving their platform for their users. They may run hundreds or thousands of experiments, and each of these need a decision. We always gotta be learning from data, and adaptive decision analysis helps to formalize these ideas.
Robust Inference
Statisticians have to make a lot of assumptions. If these assumptions are right or at least plausible, then we can feel comfortable trusting the results of statistical analyses. Stuff like confidence intervals or estimated values. But of course, assumptions won't always be right, and it's often hard to even know if they actually are or not.
And that's where robust statistics come in. Robust statistics are statistics that still provide trustworthy statistical analysis, even in the face of violated assumptions. If we have a robust model, then we don't have to be so reliant on possibly shaky assumptions. The sample median is often cited as a robust estimator for the typical value of a distribution, compared to the mean. We often hear that the mean is unduly influenced by outliers in a dataset, and this is true. But what assumption do outliers violate? Many times, we assume a distribution to be Normal. Normal distributions have the property that most of their probability is concentrated near the mean. You often hear this phrased as a the 68-95-99 rule. Outliers challenge this concentration. If there are too many outliers, it poses a danger that that the data come from a so-called heavy-tailed distribution, where extreme events are more likely. This would violate the Normal distribution assumption.
The less assumptions we have to make the better, but we have to make sure our models can account for this.
Exploratory Data Analysis
Yes, you read that right, exploratory data analysis is the last idea on the list. We're done with the theory, we're done with the computation, we're going back to plots and visuals. Plots give us a way to examine our data and assess our statistical models. It's just easier to learn from your data if you can look at it, rather than just have it in a CSV on your computer. Statistical models are meant to be approximations of the real world, so they better match up with the general trends you see in the plots. If you're obsessed with hypothesis testing, exploratory data analysis might make you mad, but its undeniable that this skill is an important part in any statistician or data scientists toolkit.
There's even an entire paradigm of R programming dedicated to formalizing exploratory data analysis. You have people who code in boring base R, but then you also have people who are code using the tidyverse framework, popularized by the god Hadley Wickham. The tidyverse set of packages makes it extremely easy to get your data into R, cleaning it and visualizing it. If any of you ever check out the Github repo for the channel, you'll see that I stick mostly to programming under the tidyverse paradigm. It's how I first learned R, and it's one of the reasons that my R skill has blown way past my Python. I highly recommend learning it, and I hope to have a more in-depth video on it in the future.
Conclusion
What does it mean for an idea to be important? At first, I thought that a statistical idea would be important if the paper that introduced it was cited many times. This was not the case for the article. The authors specifically mention avoiding citation counts. Rather, they view important ideas as those that influence the "development of methods that have influenced statistical practice." I interpret this as ideas that have given birth to larger branches of statistics.
The article is not meant to be a literal answer to the question, but gives us some food for thought. Other articles have even performed actual statistical analyses to answer this question. If you think that the authors missed a cool idea, tell me about it in the comments!
I hope that I've showed you that statistics didn't stop with the two-sample t-test and linear regression. New technologies create new types of data, so statistics needs to innovate as well to keep up.


