# Why do I need more data?

### Why bigger is better in the statistical world (and the world in general)

A few days ago, I was in bed going down a rabbit hole of Youtube videos. It started with an innocent King of The Hill Scene and ended with a spree of clips of various cartoons. One of the gems I found is shown below:

The focus of the clip was really the Green Lantern, but Superman gets in a few frames of screen time. He got me thinking about the statistical consequences of having Superman even exist in our world. This article is a the fruit of those late night thoughts.

## Super… statistician?

In a way, Superman is doing the world a great disservice by not offering himself up to help statisticians. The research world might be in a much better place if we sent Superman off to collect data for us instead of us meek researchers gathering small samples. With his speed, stamina, and bevy of superpowers (canon or not), Superman actually has the ability to gather data on *entire populations*. And he doesn’t do it! What a waste!

Why does it even matter if Superman’s able to look at whole populations at a whim? If anything, this ability possibly invalidates the need for statistics in the first place. Take for example the pretend conversation I had with an imaginary Superman:

Me: Hey Superman, how many people currently have COVID-19 in the United States right now?

Superman: oh idk, let me check really quick

-a few minutes later-

Superman: 5, 119, 347

Me: Okay cool, Google says it’s currently 3.22 million, check again for me tomorrow

In the above scenario, we treat Superman’s count of the number of COVID-19 as the **exact number **of people who have it (we’re assuming that Superman is correct). His number represents all of California cases, and we can say with exact precision the prevalence of COVID-19. In other words, **there is absolutely no uncertainty in his answer.**

Compare this to the 3.22 million that I mention in the conversation. This number is calculated via a recordings from multiple sources. Some people are missed for a variety of reasons: failing to report positive tests, false positives, and many others. There’s uncertainty in this number due to all these possible sources of error. Despite this, there’s still a way for us to at least *hone *in on what the true prevalence is (as gathered by Superman). This is what statistics is all about! Statisticians over the ages have devised ways to help us **quantify** uncertainty and know how close we are to the truth (the population answer).

One of the central problems of statistics can be framed as a *scarcity *problem. As humans, we don’t have the time, energy or resources to gather data on absolutely everyone in a population (i.e. all Californians). Instead, we gather a smaller **sample**, assume that they accurately represent the population, and then quantify our uncertainty via confidence intervals. The idea here is that your quantity of interest (a population prevalence, for example) will very likely be contained in this interval.

In consulting, a common suggestion to worried researchers is to get more data! The more data you have, the better chance you have of showing the result you want **(***assuming it’s correct in the first place)*. Intuitively, this makes sense in our heads, but why?

We’ll demonstrate this below.

## A thought experiment

Let’s introduce the hypothetical world of **Very Normal Land**, otherwise known as the one of the worst amusement park names. Exactly **100 people** live here, and these 100 people are evenly split into **5 groups of 20**.

The people of Very Normal Land have very particular internet browsing habits. Those in the first column of 20 all only browse the internet for 1 hour a day. The second column for 2. This trend continues until the last group who is constantly on the internet… for 10 hours a day. This group would most resemble people like you and me.

We’ll treat the people of Very Normal Land as our population of interest. Let’s say that we were interested in calculating the **average amount of time spent on the internet** among the Very Normal people. We would just demand that Superman talk to everyone in this group and perform the calculation for us:

`(20 * 1 + 20 * 2 + 20 * 3 + 20 * 4 + 20 * 10) / 100`

> 4

So, the population of Very Normal Land browses the internet for an **average of 4 hours.** Keep this number in mind. It really does represent the true average of the population since we’ve asked absolutely everyone in the population.

It’s really very convenient (almost unrealistic) that we are able to calculate this population amount. We’ll see what happens when disaster strikes.

## Law of Large Numbers

After being forced to talk to 100 hypothetical people on his hypothetical day off, Superman is pissed. He refuses to do work for us anymore. He flies off into the distance, leaving us — the very human statistician — to redo the calculation. Fearing that Superman may have lied about the 4 hour figure, we try doing it ourselves.

Having limited time, energy and desire to talk to strangers, we decide that we’ll just **sample **a small portion of Very Normal Land and calculate the average internet browsing time of this sample. We decide that we’ll talk to 10 **random** people and ask them. By pure chance, we find the following 10 people and then calculate the average of this 10.

`(3 * 1 + 2 * 2 + 3 * 1 + 4 * 3 + 10 * 1) / 10`

> 3.2

So, in our sample of 10 we calculate an average of 3.2 hours. As mentioned before our best guess to what the average internet browsing time of the population would be 3.2. We use the sample (which we observe) to *infer* information about the population (which we cannot observe). If we were to stop at 3.2, but we really wouldn’t be that far off.

This 3.2 is… *somewhat* close to the original 4 hour calculation, but why is it different? It’s different almost purely because of chance. In this case, the number of people who browse for 10 hours are slightly underrepresented. In the population, they are exactly 20% of the population, but only represent 10% of the sample. These constantly-on-the-internet people help to drive up the average in the population, so their absence causes the sample average to be smaller.

How do we get our sample average to look more like the population average? **Gather as much data as you possibly can. The more, the better.** Remember that the driving factor behind the smaller sample average was *proportions.* To best match the population, you want the *proportions *of people of each different group in the sample to be** as close as possible** to the proportions in the entire 100 person population. We can’t speak to everyone in the population, so the next best option is to get the sample to look as close to it as possible.

In our small toy universe where we know absolutely everything about the population, this calculation is almost trivial. In the real world, we often don’t know either of the above things:

how many people are truly a part of a population of interest (ie: number of people with a rare type of cancer? Number of people who will be saved because of a new vaccine?)

the proportions of people in the population who will respond differently to your questions (ie: knowing the type of internet browser we are speaking to)

The beautiful thing about the **Law of Large Numbers** is that *it doesn’t care that we don’t know either of these things. *The Law of Large Numbers is a famous law in probability and statistics which tells us that with more data, our sample will resemble the population more and more until eventually it will exactly match it. There’s an element of the infinity here that we are avoiding, but it’s best to practically think of this as **get as much data as you can**.

We could stop here after learning about this essential law in statistics, but this is a perfect opportunity to turn the law on its head and explore the consequences. Very real, unnerving consequences.

## Law of the Small Numbers

You just learned it, but we’ll make it loud and clear what the Law of Large Numbers does:

The Law of Large NumbersThe more data you collect, the more your sample will represent the population.

Alternatively, the more data you collect, the more precise your sample calculation will be, and it will be closer to the population value.

We’ve established it’s all well and good when we gather a lot of data, but not everybody’s got time for that. What do I *really* have to lose by not gathering a lot of data? Keep that second formulation of the Law of Large Numbers in mind.

A lot, actually.

In our original sample of 10, we were lucky to find a single person who spends 10 hours of the internet. Let’s say that were *even lazier* and decided that we would just sample 3 people. The three samples below are all equally likely to come from Very Normal Land with that sample size:

With sample 2, you get very lucky. You happened to get people whose internet browsing time *just happened* to coincide with the population average. We are not so lucky with the other two samples, where we get **wildly **different perspectives of what the average internet browsing time is for the people of Very Normal Land.

When we first calculated an average from the 10 people above, we applied what we knew about the sample to the greater population. Doing that here would make give you absurd conclusions about the population, making them out to be either people who hate the internet or cannot get away from it.

Small sample sizes should worry you because it makes *extreme* results more likely. They don’t *cause* more extreme results, it’s purely a function of how little data was collected.

It’s tempting to frame this as a “Law of Small Numbers”, but it’s really just a consequence of the Law of Large Numbers. If gathering more data *increases* the precision of our estimates, less data will also decrease it. Decreased precision essentially means that they will also vary a lot. Above, we looked at 3 different samples, each with different average browsing times that range from 1 to 10. Daniel Kahneman covers this topic more comprehensively in *Thinking Fast and Slow, *which I highly recommend.

“Law of Small Numbers”

Less data makes extreme results more likely.

## What does it mean for you?

In your general life, you should always gather as much data as you can before deciding on anything. More data means you can get closer to the (population) truth and you’re better protected from extreme results that are entirely divorced from reality. Gather more data, more observations, more opinions, so that you are more informed.

To wrap up this issue, it’s interesting to ponder where the consequences of small samples may rear its ugly head. This is speculation on my part, but I get an inkling that they play a role in the reproducibility crisis, aka the problem of “we found an provocative, interesting result, but for some reason, no one else can recreate what we did”.

With a lucky enough sample, you might just find a group of 10-hour or 1-hour browsers that help support your hypothesis, when in reality, your hypothesis was false the entire time. The “extreme value” in this case would be the low p-value which enables the spurious result to sneak in as a paper. Other teams that try to recreate the result may not be so lucky with their sample and won’t be able to come to the same conclusions. Uh oh.

I hope that this has been instructive for you! Remember to get that data.

## Create your profile

## Only paid subscribers can comment on this post

Sign in## Check your email

For your security, we need to re-authenticate you.

Click the link we sent to , or click here to sign in.