Question 6: Simulating Power

6 / 52

Sep 13, 2025

This post is a part of the Statistical 52, a weekly series where I write about questions and concepts that I think aspiring statisticians should be comfortable with.

Question

Power is defined as the probability of rejecting the null hypothesis, given that it is actually false. In some simple cases, we can calculate it directly, but this may not always be the case. Alternatively we can turn to simulation studies. Explain the process of calculating power this way.

Discussion

TLDR: Generate many independent datasets and leverage the law of large numbers to estimate power via the proportion of correctly rejected null hypotheses.

Power is an important tool for statisticians to more efficiently plan experimental studies. Although we’d like all of our studies to have large sample sizes, it would be helpful to figure out exactly what is “large” enough.

Despite its importance, I’ve found that it students don’t get a lot of practice with power or sample size calculations it in university coursework. So, we’re gonna cover it today.

I’ve covered simulation studies extensively on the Very Normal channel. If you’d like a more detailed explanation, go watch this video. Here, you’ll get a more condensed version of it here.

A simulation study involves the following:

Generate a dataset according to a specific alternative hypothesis
Perform the appropriate hypothesis test on this data
Record whether or not the null hypothesis was rejected
Repeat steps 1-3 a large number of times
Calculate the proportion of simulations where the null was rejected. This is your simulated (Monte Carlo) power.

Here’s what this would look like in code form. We’re doing a one-sample proportion test where the null hypothesis is 20%, and the alternative hypothesis is that the difference is 40%. This might correspond to a simple Phase 2 study where we’re trying to figure out if a new treatment is actually effective enough (40%) or not (20%).

set.seed(1)				   # for reproducibility
n_sims = 100000            # number of simulated datasets/tests
rejected = logical(n_sims) # for storing results

# Step 4: Repeat steps 1-3 a large number of times (here, 100,000)
for (i in 1:n_sims) {

  # Step 1: Gather the "data"
  data = rbinom(1, size = 30, prob = 0.4)
  
  # Step 2: Perform the test
  test = prop.test(data, n = 30, p = 0.2, alternative = "greater")
  
  # Step 3: Record the result (reject if p-value less than 0.05)
  rejected[i] = test$p.value < 0.05
}

# Step 5: Calculate power via proportion of rejected results
mean(rejected) # = 0.70698

There are more efficient ways to program this, but I’ve found that a for-loop better drives home the point of generating independent datasets. There’s just no way that a dataset from one iteration is going to affect one from another.

How does that work?

In the end, power is a probability. According to the frequentist paradigm, a probability is just the “long-run frequency” of a given event, which in this case is just rejecting the null hypothesis, given that it’s false.

We can directly control this in programming because we can choose to generate the dataset in specific way. Our specified null hypothesis is that the success probability is 20%, but I generated it with a probability of 40%. By definition, the null hypothesis is false for this data.

To establish a long-run frequency, we need a large number of simulations, so I simulated 100,000 datasets. Thanks to the power of modern computing, this is relatively fast and easy. There’s no “right” number to establish a long-run frequency, but in my experience, the number I see the most is actually 10,000.

What connects the proportion of rejected nulls to the probability of rejected nulls is the Law of Large Numbers. This tells us that the proportion will converge to the true underlying probability if the number of trials is large. It’s 100,000, which is plenty here.

One advantage of the simulation aka Monte Carlo approach is that we can employ hypothesis tests that don’t traditionally have nice calculations. Think nonparametric tests. This lets you be more flexible in the planning phase or if you expect your data to violate classic statistical assumptions.

One disadvantage with this approach is that your calculated power is always an approximation. That’s usually okay since power is moreso a tool for planning, but it’s just something to keep in mind.

Were you able to answer it correctly? Was there anything that surprised you? Let me know in the comments! See you in the next question.

📦 Check out my other stuff!

Read through my Statistical Garden on the Very Normal website! This digital garden houses all the knowledge I gained as a biostatistics graduate student. It’ll grow as I learn more, and it’s free for you to look through.
You can support me on Ko-fi! YouTube and Substack are the best and easiest ways to support me, but if you feel like going the extra mile, this would be the place. Always appreciated!