Issue 11: Developing a Biostatistics Paper

Some insights into how a biostatistics paper might be planned

Mar 15, 2024

Made with Midjourney. “research manuscript on a desk, sunny and calm background, painted“

In this issue, I’d like to talk about what goes into a biostatistics research paper. Some you seemed to like when I discussed my own research, so I’ll continue along that thread today.

Statisticians do research?

When I started my first graduate-level biostatistics class, I had no idea that statisticians did research. I thought that at the most advanced levels, you just learned how to use the most advanced models.

5 years later and still in school, I now have a good idea of what the basic ingredients are to a standard statistics research manuscript. In retrospect, it’s not so different from a paper from another field, but it’s also unique because statistics papers solve problems inherent in generic data rather than specific information.

Using my third paper as an example, I’ll walk you through the basic structure and content.

1. The Problem

The introduction tells the reader about the specific data problem you are considering.

A problem I’ve noticed is that clinical trials primarily use one endpoint to plan for sample size, type-I error, power, and perhaps other characteristics.

Yet, diseases rarely affect just one aspect of our lives. For example, depression may manifest itself as anxiety or suicidal tendencies. Furthermore, since these symptoms come from the same disease, they are plausibly corrleated.

These between-endpoint correlations are valuable information we should take advantage of, but the one-endpoint paradigm doesn’t allow us to do this.

1.1 Why does the problem still exist?

The introduction should also outline what solutions currently exist for your problem, and why they are insufficient.

Multivariate (multiple outcome) analysis is hard, especially if the outcomes are not the same type. There are no “easy” distributions that a statistician can use to represent the joint probability distribution of multiple outcomes. There are solutions, but they are highly technical. There are also increased computational demands (i.e. my computer is taking too long to do this).

2. Simulating data

In my opinion, the next step is to figure out how to generate data to have properties like the data you want to work with. In this case:

Multiple outcomes…
1. … that are mixed in type, possibly discrete and continuous
2. … and correlated
Possibly many treatments that alter the joint probability distribution of these outcomes

In statistical papers, you must demonstrate that your new model (or in my case, an experimental design) can work with data. Since simulating data allows us to exactly control how the data is made, we should be able to figure out how well this new model performs on this data. For instance, if we simulate the data with a treatment effect of 5, then our model better estimate something close to 5.

In my case, I think that modeling multiple solutions offers a large enough benefit to offset the increased technical cost. I might try to show this by demonstrating that my new approach is more statistically efficient (needs smaller sample size) or has better operating characteristics.

3. Crafting and refining a new statistical solution

Now that we can generate data, we need to do actual research and work to put together a solution (i.e. statistical model or algorithm) that solves our data problem. This is the hard part. You may need new mathematical theorems to show your method has nice theoretical properties (i.e. a nice asymptotic distribution, consistent estimation).

As you work towards a solution, you can test it on simulated data to see if your approach is actually working. If you work on the solution first, then you’ll have no way to check if your idea works on actual data. After you run into stumbling blocks, you return to the literature and find other solutions that people have done. You try to slightly tweak previous solutions to your specific data problem.

In my case, I need a procedure. I need to develop a plan that a non-statistician can follow and use to gather data and run a statistical analysis.

4. A case study

Once you get your new statistical method to work well with the training wheels on (simulated data), it’s good to demonstrate that it can work well on real-world data. Usually this real-world data will have been used and published elsewhere, so you can compare your new model’s results to the old and point out your improvements.

Sometimes, the case study inspires the data in the first place and may give you the problem you want to solve.

Unfortunately for me, there are no precedents for N-of-1 trials with multiple primary outcomes, but there are certainly trials with lots of secondary outcomes. Hopefully, I can find one and get their data to test my ideas on.

5. Future work

Journals have page limits. It is more likely that you’ll produce more results than you can fit within these page limits. You’ll have to decide which of your results are the most impactful and demonstrate your point, which results are okay but can go to the Supplementary Material, and which results will never see the light of day.

But there will also be work that you didn’t do. Maybe you didn’t do the work because you needed to make a simplifying assumption. Maybe you didn’t consider all the possibilities with simulated data. But either way, there will always be work to go around, and you can figure out what will go into the next paper.

Currently, I’m at the stage where I’m trying to figure out how to simulate multivariate, mixed, correlated data. I certainly have my work cut out for me.

Hope this was interesting to you, see you in the next one.

Christian

😵‍💫 What am I working on right now?

Working on a video about power and sample size calculations

🧐 What am I enjoying right now?

Books — Taking a pause from reading because I am sick of listening to audiobooks right now.
Shows — Just learned that Physical 100 is getting a new season, so I’ll be enjoying that

📺 Recent videos

What haunts statisticians at night — a video explaining the problem that plagues statistical analysis: the confounder. Confounders get in the way of figuring out causal effects, and I explain why they’re a problem. Also, it’s my first ever sponsored video!