Issue #30: How I break down statistical concepts
My opinion on why statistics is so inaccessible to people
📰 What is this issue about?
Making explicit the steps I take in explaining concepts
📺 What’s happening with videos?
Recent — 5 questions for staying sharp with statistics: a new format that tests you on some basic statistics concepts
Upcoming: the last employee training video, on generalized linear models
This issue is a response
to a comment I got on YouTube a while ago. The comment asked me to break down how I explain concepts in my videos. I hadn’t really thought about it until I’d seen this comment.
I had videos already planned for December by the time I’d seen this comment, but it was worth a shot to at least try to put it into words. It’ll help me with scripting later too.
TLDR: take all the mathematical statistics concepts, notation and language and translate their benefits into something an actual analyst would want to know.
The problem
First, I think it’d be a good idea to show you how I view (bio)statistics in the first place:
On one side, there’s mathematical statistics. This is the side of statistics that gives us all the theorems that help us justify the use of particular statistical methods — hypothesis tests, regressions, etc. Like it’s name suggests, it’s heavily math based. It’s what helps give statistics its rigor
On the other hand, statistics is something that people needs to use in real-world applications. To me, this is applied statistics, where all the random variables start to have real world meaning. For me, it’s human subjects in clinical trials.
Most people who use applied statistics (i.e. researchers, data scientists) are lightly trained in the bulk of methods that can answer the most common questions. Statistical methods are moreso a means to an end: an answer to a research question. The underlying math and notation are moreso afterthoughts.
And that’s the crux of the problem.
The solution
From my perspective, the problem that most people have with statistics is notation and language.
Mathematical notation makes the underlying ideas more opaque. It’s not helped by the fact the idea of being “not built for math” floats around
When statistical ideas are taught, their rationales are often taught from the perspective of mathematical statistics. It’s not always clear how good math properties lead to good real-world properties.
For instance, statistics students will know that the OLS estimators for linear regression have several optimal properties: consistency, minimum variance, etc. These properties are great and all, but what do they mean in terms of the bottom line for an actual real-world analysis?
When I write a script to talk about a statistical model, I try to take all of the relevant notation and language of mathematical statistics and convert it into an applied statistics context. For me, usually that’s clinical trials because that’s what most people would be familiar with.
I’ve found that this translation can vary widely in terms of difficulty, even for the most basic hypothesis tests. Aside from editing, it’s probably what takes up the bulk of my time in making videos.
Many times, I find I have to scrap the original justification from mathematical statistics because its relevance to real-world results is too obsure. For instance, in the logistic regression video, I make a sort of “range matching” argument for why we look at the log-odds. Log-odds help to match the range of numbers covered by continuous covariates.
Statistics students would immediately know the log-odds come from the fact that they’re the natural parameter for the exponential family form of the Bernoulli distribution.
But none of that will ever be important to someone using logistic regression. They want to know what the results mean for a paper or report. From my view, it’s more important that statistics users know and respect the intuition, meaning and assumptions behind a model. That way, they are much less likely to abuse it and can pass off this wisdom to others (in an ideal world).
The best statisticians are the ones that act as a bridge between the technical math stats and real-world applied stats. It’s the reason that 90% of statistician job positions require “excellent communication skills”.
The process is an exercise in empathy rather than flexing statistical knowledge.
That’s it for this one, see you when the next video comes out
Footnotes
🧐 What am I enjoying right now?
Book — I’m re-reading a book that I’ve read many times before: A PhD Is Not Enough!: A Guide to Survival in Science (affiliate) by Peter J. Feibelman. A Ph.D student needs to take charge of their own skill and career development. Your department or boss are not obligated to help you, but this book has some helpful tips for looking out for yourself.
📦 My other stuff
I wrote guided solutions to problems to Andrew Gelman’s Bayesian Data Analysis. It’s for advanced self-learners teaching themselves Bayesian statistics
I’m now on Ko-fi! YouTube and Substack are by far the best (and easiest) ways to support the channel, but if you feel like going the extra mile, this would be the place. It is always appreciated!
I'm really grateful for your channel, and now your Substack too.