How I Learned to Stop Worrying and Love the Bell
Part 2 on why you need more data
This post is a follow up to my previous post, “Why do I need more data?” Reading that will help give some context here. Enjoy!
One of the hardest parts of starting anything is giving it a name. I chose Very Normal because it was a joke I’ve liked for a while, and it references an important concept in statistics in general. After a quarter and a half into my Ph.D studies, the name has only become more and more appropriate after seeing exactly just how ubiquitous the concept of normality is in the field.
In this post, I hope to give some context on exactly why statisticians give so much attention to the normal distribution. Even if you’re not in a role that involves data collection or research, normality still finds a way to sneak into our daily lives. After you’re done reading, I hope that you will learn to love the bell too.
Reintroducing Very Normal Land
Before I can tell you why the normal distribution matters, I need to set the scene for this week’s article. We go back to Very Normal Land.
A certain city in the hypothetical state of Very Normal Land has a population of 100 people. Everyone in this city has particular internet viewing habits. Some people are only on the internet for 1 hour a day, others are on for 5 hours. (In the last article, the blue people were on for 10, but I’ve changed it to 5 here for visualization purposes). There are only 5 types of internet users (shown above), and there are 20 each of them in the city.
As a researcher, I want to understand the average amount of time that the people in the city use the internet. While theoretically, we can calculate the average based on the information above, we don’t learn the moral of the story that way. We’ll assume that I don’t have the ability to interview all 100 people in the city because I’m too lazy. In general, researchers do not have this level of knowledge about the populations they want to study, so we will pretend we don’t have access to this information either.
I don’t have have time and energy to interview 100 people, but I do have though is an understanding of statistics and access to…
The Infinite Army of Data Collectors
Here I’ll introduce another helpful metaphorical concept that will really nail down why normality is so important to us. I myself am not willing to interview people, but I do have access to a vast army of data collectors who are willing to interview a few people. Not 100, but they have enough willpower to talk to maybe… 3 people. That’s not a lot, but we do have access to a lot of labor.
In part 1, we noted that when possible, you should try to gather as much data as possible. Thanks to the Law of Large Numbers, the more data we gather, the better our sample average (the average of the people you interview) will represent the population average (the average of the entire city). Conversely, if you are only able to gather a small sample, you should also expect to see some extreme averages that are either really low or really high. These extreme averages happen when only people with 1 or 5 hours of internet time are interviewed.
In real life, it gets more complicated. We are often restricted to just collecting a single sample. This is all we see, so it’s hard to know if its low or high or “regular”. However, the sample average has a very special property that we will take advantage of thanks to our hypothetical army of data collectors.
The Sampling Distribution
Now, for our first experiment.
I have gathered an army of 300 data collectors. This is a far cry from an “infinite” amount, but it’s enough to demonstrate the key insight of the article. Absolutely none of these data collectors know anything about the city we’re trying to study. All of them are given the same instructions:
Go out and talk to 3 random people
Learn how much time they spend on the internet.
Calculate the average from these 3 people
Come back to me and tell me what the average was
In this way, I effectively get 300 samples, which means I get 300 sample averages. As each of the data collectors comes back to me, I will add that sample average to a tiny histogram. The key here is that I don’t worry whether or not the sample average is extreme or not, my only job as the leading statistician is to make a histogram.
Now we watch the histogram evolve as the data collectors finish their interviews:
While extreme averages are possible, they are rarer than more moderate averages. We see this in the build up of the large clump of sample averages around 3, which is there we know the true population average to be. Even though all of the sample averages are random, as a whole they have a crucial relationship with the population average: The sample average forms a normal distribution that is centered at the true population average. With 300 samples, its harder to see the bell shape, but rest assured that it only becomes more and more apparent with more samples. This normal distribution is known as the sampling distribution, where “sampling” refers to the fact that it represents what values the sample average can take.
I remember this was the first major mental hurdle in my statistics learning. The sample average has a distribution? I thought it was just a number. As mentioned earlier, we often only see a single average that we ourselves calculate. The key idea here is that if were were to repeat the experiment many, many times, then we would reliably know that the (unknowable) population average is at least somewhat close to the single sample average that we calculated. Statisticians know that the “true” average is unknowable, but we have tools that let us understand a range of values that it might be. That’s an entire article for another week though.
High schoolers often get a variation of this experiment in their AP statistics class. The data collectors are the students themselves, and they are told to collect data and calculate their own sample average. Then their teacher creates the histogram.
We can use this same simulation to demonstrate the power of the Law of Large Numbers. Let’s say that I give the same instructions to another set of 300 data collectors, except that these unlucky fools have to talk to 20 people instead. 20 people makes it much, much more unlikely to see extreme sample averages, so we would expect the sampling distribution to “clump” up more around the population average.
We can see this below:
When in doubt, always be collecting more data.
Why normality matters
Our little simulation was intended to demonstrate a situation where a normal distribution comes up. This situation happens every time we are trying to sample a bigger population, which essentially covers the idea of research itself. If you take anything away from this article, it’s that the sample average has a distribution, and that distribution is bell-shaped.
To wrap this up, we always want to ask ourselves, “So what? It’s normally distributed, big deal.” I’ve only lightly touched on the role of probability in statistics here, but the answer lays there. If you know or assume that something has a normal distribution, it gives you access to some handy information:
You know what values the sample average can realistically take
You know that the population average is somewhere in this range of values, which is a lot better than not knowing anything about it at all
Conversely, you also know what values are considered “improbable” or even virtually impossible for the sample mean
There’s definitely more to be said here, but that’s a separate article. You can find my code to produce the GIFs here. Feel free to download and play around with it. Until next week!
Very Normal comes out weekly every Friday at 5pm. Please subscribe if you’d like to see more!