Curious Consulting: On Estimating Prevalence
Make sure you match the population that you want to investigate!
One of the requirements for getting my Master’s degree was to spend a semester shadowing under a biostatistician for a consulting session. Consulting sessions involve bringing in researchers, often medical doctors or researchers, and providing them with advice/support on their research efforts. This support can take the form of offering advice on how to design a study to best answer a research question or even performing analyses on data that’s already been collected. As a budding biostatistician, I attended a session and tried to offer my own advice, and then I listened to the actual Ph.D give his two cents.
Consulting’s really fun, it gives you a lot of exposure to different research problems. You quickly learn that real data collection and analysis is dirty and complicated, so you have to figure out how to adapt the sterile statistical techniques you learned in class to real life. It’s a very different muscle to flex when you any statistical question is fair game, and it’s doubly hard when your collaborator doesn’t precisely understand all of your statistical jargon.
This article will be a (slightly) fictional account of my own consulting experience. Any names and study details have been altered, but I’ve kept all of the details relevant to the core statistical problem.
Dr. Doe came into the small office with his associate, a fellow doctor. I remember being nervous because I accidentally choked on my coffee when he extended his hand out to greet me. After the formalities were done, we started to learn about the context of the study.
Dr. Doe was interested in investigating a peculiar medical practice in a far away country. This practice was a somewhat unsettling mix of medical practice and ritual: the removal of the uvula, known professionally as the little punching bag in the back of your mouth. In modern nations, this practice has demonstrated little medical benefit, but it still sees use in some developing countries. Uvula removal in this country was often done in non-sterile settings with rudimentary tools, which often resulted in dangerous infection.
The country that Dr. Doe and his team were researching the procedure is among the poorest of nations and is known to be dangerous to travel in. Because of this, Dr. S and his team set up clinics in a relatively safe region of the country. The prevalence of uvula removal has been documented in other countries, but was not done yet here.
📗 Definition 📗
Prevalence: how common an condition/event is in a given population. Defined as the number of people who have the condition (ie uvula removal) divided by the total number of people in the population
Prevalence is important because it gives you an idea of how rare or common a disease is in a city/country/group.
In order to collect data, the medical team surveyed all of the people that attended the clinic, asking them “Have you ever had your uvula removed?” The team also recorded the age at which the uvula was removed, as best recalled by the participant. These were the only two bits of information recorded.
One thing that was dug up during consultation was that the team also wanted to get a better idea of how common uvula removal was in children. To answer this question, the team sampled more kids than they initially wanted to.
The data had already been collected, so Dr. Doe and his associate really just came in with a quick clarifying question on how to calculate the prevalence. However… after hearing their story, there were a lot of things that needed to be addressed before that calculation could be made.
Think like a biostatistician!
After reading through the described scenario, what do you think might have been wrong with the design of Dr. Doe’s study? Think about it a bit and try to figure out what we discussed during that consultation.
After you’ve given it some thought, see what we told Dr. Doe.
Lessons in representation
The start of Dr. Doe’s troubles actually starts with his the location of his clinics. His goal was to estimate the prevalence of uvula removal in the entire country itself, but unfortunately he was unable to set up clinics in a representative area. Dr. Doe has to ask himself a question: are the people his team interviewed in the clinics representative of the country’s people in general? If so, then he can probably move forward with the calculation. If not, he’ll need to perform a correction.
Proper representation matters because you don’t want to run into the problem of an “unlucky” sample, where the people you interviewed are sicker/healthier than the average person in the population. For example, if you happen to have a clinic in an area where uvula removal is especially common, then you’re more likely to encounter them at a higher rate than you would another town.
Dr. Doe has another representation problem, similar to what we’ve discussed above with clinic locations. Recall that Dr. Doe’s team suddenly took a sudden interest investigating the prevalence specifically in children. This happens a lot, sometimes you’ll get an exciting idea idea branching out from your original research plan.
While Dr. Doe certainly collected a lot of data on children, he inadvertently messed up the age distribution of his data. Now, children are overrepresented in the sample. Whatever the prevalence of uvula removal is in children is, Dr. Doe’s final calculation will be biased towards that rate.
This idea of matching age distribution ties closely to the problem of the unlucky sample. In the ultimate ideal world, Dr. Doe and his team literally interview everyone in the country in a short time frame, which let’s them calculate the population prevalence that they were interested in. But that’s almost never the case. In the less ideal, but still pretty ideal world, the sample that Dr. Doe collects has the same age distribution as the overall country, which he can verify via a census. Then in this case, his calculation will be closer to the true prevalence. This is true of any study that you want to run:
Moral of the story:
Make sure your sample represents the population you’re trying to study!
How to fix the problem
Dr. Doe thought he was coming in for a quick calculation question, but he got a lecture in proper sampling from the presiding biostatistician. There was nothing to be done about the clinics, that was a reality of the data collection that couldn’t be avoided. You address that type of generalization weakness in the Limitations section of your manuscript.
For the age distortion, Dr. Doe needs to perform some weighing on his data. He has too many children and too little of the other age brackets, so he has to change how much each observation influences the calculation. In a regular prevalence calculation, everyone has the same weight. In this case, he needs to downweight the children observations and upweight the other observations, ideally to match the actual age distribution of the country.