Office Hours #2: What the hell are degrees of freedom?

A common frustration for statistics students

May 24, 2024

I read a lot. But sometimes, I’m not sure if I should commit to buying a book if I’m not sure what it’s about. But thanks to Shortform’s book summaries, I can quickly get the gist of a book and better judge which ones I should dedicate my limited time to.

Shortform creates high-quality summaries of the main points of several books. Not only do they capture the essence of each book, they provide perspectives from other authors to support or even challenge the book author. If you’re like me and you prefer to listen to book, then Shortform also provide audio readings of their summaries.

If you join through my link shortform.com/verynormal, you will receive a free trial of unlimited access and an additional 20% discounted annual subscription. Give it a try, will ya?

In this issue…

I’ll be tackling one of the most common questions that I see from statistics students. I see it a lot on Reddit, and I got it a lot as a teaching assistant. The question is:

What the hell are degrees of freedom?

A look at the Wikipedia entry for degrees of freedom gives the following definition from a statistical perspective.

“In statistics, the number of degrees of freedom is the number of values in the final calculation of a statistic that are free to vary.”

Hmm… not that helpful.

“Free to vary”?

These are one of those definitions that are much easier to understand if you have an example or two to draw from.

Let’s say that I gave a survey to 100 of my subscribers, and asked a single question:

What’s your highest educational achievement?
Less than high school
High school degree
College degree
Graduate degree

We can treat the number of people answering for each category as a random variable:

\(\begin{aligned} &X_1: \text{Number of people with less than high school education} \\ &X_2: \text{Number of people with a high school degree} \\ &X_3: \text{Number of people with a college degree} \\ &X_4: \text{Number of people with a graduate degree} \\ \end{aligned} \)

I don’t know how many people will answer each particular category, but I have fixed the sample size to 100. Even though there are four random variables here, this fixed sample size means that they can’t all be random.

\(X_1 + X_2 + X_3 + X_4 = 100\)

Let’s say that I look at the data, and realize that 10 people who answered the survey said that they have less than a high school education. One of the random variables has been “realized”, and we are left with the following expression:

\(X_2 + X_3 + X_4 = 90\)

The specific number of people with less than a high school degree doesn’t matter; what’s more important is that it can take any (positive) value it wants.

The same can be said of two more of these variables, but not the last one. If we check the data and observe that 30 people have a high school degree and 40 have a college degree, we’re left with the following expression:

\(X_4 = 20\)

Unlike the other 3 random variables, once those are decided, the value for the last variable is automatically decided. It is not allowed to vary like the other 3. There are technically 4 random variables here, but only 3 degrees of freedom. That is, 3 of these variables can be observed to have any value (“free to vary”), but once they are, the last variable is determined.

The fixed sample size is a constraint that limits the degrees of freedom. If the sample size wasn’t fixed, then all four random variables would be allowed to take whatever value.

This isn’t a statistics example, but it helps explain the specific wording of the definition.

As for a statistics example…

Degrees of freedom as “currency”

Degrees of freedom are relevant to most analysts when they need to estimate model parameters. I like to think of degrees of freedom as a sort of currency. It’s something I can “spend” on the analysis, and the total amount of currency I have is my sample size.

The more complex a model is (i.e. it contains more parameters), the more currency I have to spend in order to estimate all of them. Take for example a multiple linear regression:

\(Y_i = \beta_0 + \sum_{j=1}^{p-1} \beta_j X_{ij} + \varepsilon_i\)

If I have 1000’s of observations, then the resulting t-distribution for the regression coefficient estimators are awfully close to Normal. In this case, the number of degrees of freedom don’t really matter.

But if my dataset is smaller – say on the order of tens of observations – then degrees of freedom become more relevant. As I “spend” more of the data to estimate more parameters (more regression coefficients), the degrees of freedom for the underlying t-distribution get smaller.

This has the effect of widening the sampling distribution, which has the effect of reducing your power due to higher variance. This is why statisticians sometimes prefer simpler models if they can get away from it, at least in an inferential setting.

Strictly speaking, the degrees of freedom here are the parameter for the t-distribution. But this fact alone doesn’t inform us much about why we should care about them.

I hope this short article helps shed a light more on this small detail in statistics.

See you in the next one.

Christian

Current State of The Channel

😵‍💫 What am I working on right now?

Editing a video on how statisticians use randomness to do some crazy stuff

🧐 What am I enjoying right now?

Book — I’m listening to Antifragile: Things That Gain from Disorder
by Nassim Nicholas Taleb by Nassim Nicholas Taleb. It’s been a few years since I’ve read it, and I remember it being influential on my younger self. It was surprising to hear that his ideas still persist in the things I do today.
Thing — I got a camera! You’ll be seeing my ugly mug a bit more, not just on YouTube but in other pieces of content I have in the works.

📺 What are my recent videos?

Explaining nonparametric statistics, part 1: a video about Wilcoxon’s Signed-Rank test, a nonparametric analog to the one-sample t-test. It’s also an introduction on why statistics has a “nonparametric” branch.

📦 My other stuff

I wrote guided solutions to problems to Andrew Gelman’s Bayesian Data Analysis. It’s for advanced self-learners teaching themselves Bayesian statistics

Heads up! Some of the links here are affiliate links, so I may get a small amount of money if you buy something from them. I only link stuff I actually use.