Question 1: 50-50 Split
1 / 52
This post is a part of the Statistical 52, a weekly series where I write about questions and concepts that I think aspiring statisticians should be comfortable with.
Question
In a randomized clinical trial (RCT) or AB test, we strive to achieve a 50-50 sample split between the treatment and placebo/control group. Why is this the case?
Discussion
TLDR: We strive for 50-50 splits in RCTs because it is optimal in a statistical sense. It minimizes the variance of the difference in sample means, which gives us the highest efficiency possible.
RCTs (or AB tests) are an experiment for comparative efficacy. We compare a treatment group against a control or placebo group to assess if there is a meaningful difference between these two groups. Thanks to the benefits of randomization, a non-zero difference suggests that the treatment is causing this difference.
More specifically, we are interested in looking for a difference in the population means of these two groups. So, we use the difference in the sample means as an educated guess for the difference in population means:
Where A and B indicate treatment and control group, respectively.
Both of these means will vary slightly in value depending on the data we collect. So, there is a degree of variability in the difference we will actually observe. It’s in our best interest to minimize this variance.
Below is an expression of this variance:
The first line hinges on the Central Limit Theorem. The result in the second line comes from assuming that the variances of the two groups are the same (i.e. homoskedasticity).
Since we are running an experiment, we have total sample size to recruit for:
Taking a step back, let’s say that we don’t necessarily want to go for a 50-50 split yet. Instead, we want to figure out directly what proportion of the sample size should be dedicated to the treatment group. We’ll designate some notation for this:
We can substitute these expressions back into the variance equation:
What’s relevant here is that the variance of the difference in sample means is a function of the proportion of people assigned to the treatment group.
Our goal is to minimize the variance, so an equivalent goal here is to maximize the expression in the denominator:
Hence, the value of pi that maximizes the denominator and minimizes the variance is 0.5. This is why a 50-50 split in an RCT is in a sense optimal.
Were you able to answer it correctly? Were you able to learn anything new? Let me know in the comments!
See you in the next question.
📦 Check out my other stuff!
Read through my Statistical Garden on the Very Normal website! This digital garden houses all the knowledge I gained as a biostatistics graduate student. It’ll grow as I learn more, and it’s free for you to look through.
You can support me on Ko-fi! YouTube and Substack are the best and easiest ways to support me, but if you feel like going the extra mile, this would be the place. Always appreciated!



I could really follow your explanation. The only part where I had problems was about the part where you maximize the expression in the denominator:
I guess you can instinctively tell that the result where the derivative of a function = 0 means the maximum, but if you are not very strong in maths, this conceptual step actually took me 10 mins... I guess it is the hardest math part here...
Anyway, thanks for the article! :) I am very interested in this topics.
Really liked the mix of algebra + concepts you wove into this post!