What is functional data analysis?

A small look at a cool branch of statistics

Oct 21, 2023

This blog post is a lightly edited version of a video I posted on Youtube. If you’d like to watch instead, go for it! Otherwise, feel free to continue reading.

Technology is evolving and improving at an incredible pace in our modern times. One year ago, I was a caveman who needed to search for programming advice on Google, but now, I can just ask a tiny little robot in my browser.

But it’s not just with AI, it’s with data as well. Innovations in brain imaging have given us ways to visualize the brain in incredible detail. Our smartphones and wearable devices have given us the ability to track our activity and sleep at the minute level. We're dealing entire images of the brain or physical activity recorded over many days, which produce extremely large datasets. In response to this, statisticians have started to devise ways for us to analyze these new complex forms of data.

In this video, we'll talk about one of the branches of statistics that deals with new breed of data: functional data analysis.

Functional data analysis

Functional data analysis, or FDA, is a term first coined by James Ramsay, who wrote one of the foundational textbooks on this topic.

For reference, let's look at the simple linear regression model:

\(Y_i = \beta_0 + \beta_1 X_i + \epsilon_i\)

This model tells us how changes in the independent variable X are associated with changes in the outcome Y. In this case, both X and Y are numbers, which we'll call scalar values.

In functional data analysis, or FDA, one or both of these scalars are replaced with a function. Depending on which one is the function, we can get three different flavors of functional regression.

When the covariate is a function and the outcome is a scalar, we call this scalar-on-function regression. When this relationship is flipped, we get function-on-scalar regression. And finally, when they're both functions, we get function-on-function regression. If you've never heard of The idea of having a functional covariate or outcome might feel a little crazy, so let's have a closer look at each model and see through an example from the literature.

Scalar-on-function regression

In linear regression, the coefficient represents the change in the outcome for a unit change in the covariate. In scalar-on-function regression, we're interested in estimating a coefficient function instead.

\(Y_i = \beta_0 + \int \beta(t) X_i(t) \, dt+ \epsilon_i\)

For a given t, this product indicates a small contribution to the change in the outcome. Then, this integral indicates that we need to sum over all these little contributions over the values of t. So overall, this coefficient function will describe which regions of the covariate function contribute to reducing the outcome and which contribute to increasing it.

Let's look at a real example from the literature. In this paper, the functional covariate was derived from functional magnetic resonance imaging, or fMRI, data. fMRI was used to measure the blood oxygen level activity in a particular of the brain, and they've abbreviated this as BOLD. A hot and warm stimulus was applied to a subjects arm, and this brain signal was tracked over time. The outcome in this case is pain intensity, so the researchers were interested in seeing how changes in this brain signal over time could be used to predict pain intensity.

This function here represents the estimated coefficient function, along with pointwise confidence intervals. It suggests that the time after the stimulus is associated with higher pain intensity.

Function-on-scalar regression

Function-on-scalar regression takes this form, and we can see that the form of the model changes to fit the new form of the outcome.

\(Y_i(t) = \beta_0(t) + \int \beta(t) X_i \,dt + \epsilon_i(t)\)

Here, both the intercept and the error are now functions themselves. Like in scalar-on-function regression, we're interested in estimating and examining this coefficient function.

We can think of this intercept function as kind of like a baseline since it's what's left when the covariate equals zero. When it changes value, this coefficient function then represents changes to the baseline function. Here's an example from the literature.

In this study, the functional outcome is physical activity in a given day. There are actually several predictors that are used in the study, including season, TV use, having an American mother, having asthma, and gender. These clouds represent the actual functional data, while these bold lines represent the stratified averages by each variable here.

In the raw data, you can see that cold seasons produce an average activity profile with slightly less activity from noon to 6pm. When we look at the associated coefficient function with colder seasons, we can see a drop around that time and that the magnitude of this drop roughly matches what we see in the data.

Function-on- function

Finally, we get to function-on-function regression:

\(Y_i(t) = \beta_0(t) + \int_S \beta(s,t)X_i)(t) \, ds + \epsilon_i(t)\)

Like with function-on-scalar regression, both the intercept and error are functions. But now, the coefficient function is now a kind of coefficient surface. I'll be honest and say that I don't really know how to interpret this surface. If I were to give a guess, I would interpret this second dimension as taking the average change to the baseline function over the t domain.

I tried to find a actual use cases published in the literature, but I could only find case studies in statistical journals. In an exploratory analysis in this paper, the authors look at functional magnetic resonance imaging data, or fMRI data. I'm not well-versed in this field, so just a heads up. If you know about this technology a bit more, let me know in the comments! The functional covariate in this case is a rough measure of water diffusion in the brain along tracts of two parts of the brain, which we'll call area A and area B.

This diffusion is used as a measure of the demyelination in the brain, which is used as a proxy for how well the brains connections are working.

The functional outcome is actually the same as the functional covariate, but it was measured in a third area of the brain, which I'll call area C. In this case, the coefficient surface represents a kind of spatial association between two regions of the brain; for example, between area A and area C.

Here's the estimated coefficient surfaces found by the researchers.

These regions with dark red and blue are regions that are significantly associated with changes to the baseline function.

Functional data analysis is an exciting branch of statistics, and we're only just starting to see it being used in contexts outside of statistics papers. As of the making of this video, I'm helping out with a functional data analysis that looks at the effect of an exercise intervention on physical activity in elderly people. It's really exciting stuff, and I hope I'll be able to share the results with you all in the future.

Until then, like this video and subscribe to the channel or newsletter if you'd like to see more topics like this. I love showcasing these bleeding edge branches of statistics, even if I don't have the most solid grasp of them. See you in the next one.