### 6. Statistics of Sparsely Sampled Curves

**Side:**95-109**DOI:**https://doi.org/10.18261/9788215041193-2020-06**Publisert på Idunn:**2020-12-08**Publisert:**2020-12-08

We develop new statistical methods for analyzing sparsely sampled curves that vary in time. The typical dataset is differences in log gene expressions from case-control pairs for a large number of genes sampled relative to time of diagnosis. We focus on weak signals in the gene expression in many genes instead of strong signals in a few genes. The methods are based on moving windows in time, hypothesis testing, dimension reductions and randomization of the time to observation.

**Keywords:**Sparsely samples curves, case control differences, time to diagnosis, hypothesis testing, randomization

## Introduction

In this chapter, we describe statistical methods for analyzing the data described in the previous paragraphs. The methods may, however, be applied more generally. We only need observations on irregular points in time from one or several strata from many different response variables, and we test whether the response variables are stationary and whether there are differences between the strata. The approaches are based on using a moving window in time and hypothesis testing.

The lack of repeated samples, in addition to irregularly sampled data and the quite small number of case-control pairs (around 400–500 or less), is a challenge, especially in analyses stratified on important clinical parameters. The aim of some of the analyses is therefore to show that there are changes over time or between strata, but without identifying single differentially expressed genes or gene sets. Normally, when identifying such genes or gene sets, we test multiple hypotheses and consequently need to adjust for this, for example by controlling the false discovery rate (FDR). For some of our datasets, we would then find no or very few genes. We therefore need approaches with fewer tests that include many genes in order to avoid the problem of low power, noisy data, and multiple testing. Such approaches can be viewed as an effective method for dimension reduction in studies of functional genomics.

In such approaches, we need to decide how many genes to include
when testing different kinds of hypotheses. Assume the genes have
been ranked so that those that are varying most significantly have
the highest rank. In (Holden 2015a, Holden 2015b), we showed that
if there is a difference in average value of *X*
_{
g,c
} between
the strata for some of the genes, but we do not know which genes,
and the difference is normally distributed, then the statistical
tests are strongest for a small rank. We concluded that if the distribution
has heavier tails than the normal distribution, we should focus
on the few genes with the strongest signal. On the other hand, if
the distribution has a less heavy tail, for example a constant difference
in the average value, then the statistical test is strongest for
a larger rank, often larger than (closer to) the number of genes
with a difference in average value between the strata.

This chapter discusses the analysis of functions *f*
_{
a,g
}(*t*)
for different strata *a*,*b*, ... and
many genes *g* where the function varies with time *t*.
These functions may be denoted as trajectories. The challenge in
our case is that the functions are sparsely sampled, the different
strata are sampled at different points in time, and there is considerable
noise in the sampling. We wish to identify the time dependency in
the data and the differences between the strata. We expect to find
small differences in many genes instead of larger differences in
one or a few genes. This chapter summarizes some of the applied
work.

In order to study how gene expression profiles vary with time
in the years before or after a cancer diagnosis, several datasets
showing gene expression in blood have been collected. Each such
dataset consists of cases diagnosed with cancer and healthy controls.
Each case and control belong to one case-control pair. The case and
control of a case-control pair each gave one blood sample, and they
are matched by time of blood sampling and year of birth. In the
statistical analyses, we use the differences of the log2 gene expression
levels, *X*
_{
g,c
}, for each
case-control pair *c* and gene *g*.
Here, *c* = 1, ... , *M*, where *M* is
the number of case-control pairs, and *g* = 1, ... *N*
_{
g
},
where *N*
_{
g
} is the number
of genes. As the time intervals between blood sampling and cancer
diagnosis vary from case to case, the case-control pairs provide
information on the sparsely sampled curves describing gene expression over
time some years before or after diagnosis.

Let *t*
_{
c
} be the time
interval between blood sampling and cancer diagnosis for case-control
pair *c*, where *t*
_{1} ≤ *t*
_{2} ≤ ... ≤ *t*
_{
M
}.
We assume the log2 gene expressions *X*
_{
g,c
} follow
a smooth function in time *f*
_{
s
}
_{(}
_{
c
}
_{)}
_{
,g
}(*t*
_{
c
}) = *E*{*X*
_{
g,c
}},
where *s*(*c*) is the stratum of case
c. We estimate the function *f*
_{
s
}
_{(}
_{
c
}
_{)}
_{
,g
}(*t*)
by taking an average of the observations *X*
_{
g,c
} from
stratum *s*(*c*) in an interval in
time. The variance of *X*
_{
g,c
} is
estimated from the variance of the observations in the same interval.

## Methods for analyzing complex curves

### Moving window in time

A central part of the statistical methodology is to examine how
gene expression varies with time. By dividing the entire time period
into shorter time periods and computing different kind of statistics
for each such time period, we simplify the problem to examining
how these statistics vary with time. We use overlapping time periods
by using a moving window in time. The statistics computed for a
time period are independent of time. As the distribution of the
gene expressions may vary with time, the lengths of the time periods
should be chosen such that we obtain as short time periods as possible.
However, to obtain as good estimates as possible for each time period,
there should be as many case-control pairs as possible within each
time period. There is a trade-off between these two wishes. To conclude,
we define *M* – *L* + 1 time periods
[*t*
_{1}, *t*
_{
L
}],
[*t*
_{2}, *t*
_{
L
}
_{+1}],
... , [*t*
_{
M
}
_{–}
_{
L
}
_{+1}, *t*
_{
M
}]
where *L* is chosen such that we obtain short time
periods with many case-control pairs. Typically, we let *L* ≈ *M*/4.

Compared to an approach where the time periods are not overlapping, an advantage with the moving window approach is that we are better able to identify the points in time relative to diagnosis where changes in gene expression occur.

### Randomization for estimating null distributions and p-values

In most hypothesis tests, we compute p-values by estimating the null distribution for the statistic of the hypothesis test by randomizing the data, i.e. interchanging covariates (time to diagnosis, case/control, etc.) between the patients. In the randomization we preserve critical properties of the genes (level of expression, complex correlation between genes, etc.) and randomize only what is connected to the changes in time, stratum or case/control status. This randomization defines the null distribution for the test statistic that is used when finding the p-value.

We can randomize the data either by randomizing the case and control in each case-control pair, by randomizing the case-control pairs between the periods, or by randomizing between the two strata within the time period. Note that all three randomization strategies maintain the correlation structure between the genes for each case-control pair. Also note that each randomization of the data leads to a different ordering of the genes if the genes are ordered according to the statistic of the hypothesis test.

The p-value of the test is set to
,
where *N*
_{
S
} is the total
number of randomizations and K is the number of randomizations out
of *N*
_{
S
} with a more extreme statistic
than the statistic for the real data. When we test one hypothesis
for each gene, we take multiple testing into account by using the
Benjamin-Hochberg procedure for controlling the false discovery
rate, FDR (Reiner et al. 2003).

### Extracting information from a time period

When extracting information for a time period, we use the gene expression data in that time period and ignore information about time when we compute different kinds of statistics, identify differentially expressed genes etc. We have used the following statistics for a time period:

*Ordered standard deviation, mean and weight for a time
period*

Let *m*
_{
p,g
} be the
sample mean and *s*
_{
p,g
} be
the sample standard deviations for the differences in log_{2} gene
expressions for gene *g* in time period p. The variable *m*
_{
p,g
} is an
approximation to *f*
_{
s
}
_{(}
_{
c
}
_{)}
_{
,g
}(*t*
_{
c
}) = *E*{*X*
_{
g,c
}}
and *s*
_{
p,g
} an approximation
to the standard deviation of *X*
_{
gc
} in
the interval. Let *m*
_{
p,g,a
}(*m*
_{
p,g,b
})
be the sample mean and *s*
_{
p,g,a
}
_{ }(*s*
_{
p,g,b
})
be the sample standard deviations for the differences in log_{2} gene
expression for gene *g* in time period *p* for
stratum *a* (*b*). We define the statistics *s*
_{
p
}
_{,(}
_{
g
}
_{)}, *m*
_{
p
}
_{,(}
_{
g
}
_{)},
and *w*
_{
p
}
_{,(}
_{
g
}
_{)} from
these sample means and standard deviations as follows:

*s*_{ p }_{,(}_{ g }_{)}=*s*_{ p,g }_{ʹ}, where*s*_{ p,g }_{ʹ}has rank*g*when the*s*_{ p,g }’s for period*p*are sorted in increasing order. Rank 1 corresponds to the smallest of the*s*_{ p,g }’s for period*p*.*m*_{ p }_{,(}_{ g }_{)}= |*m*_{ p }_{,}_{ g }_{ʹ}|, where |*m*_{ p }_{,}_{ g }_{ʹ}| has rank*g*when the |*m*_{ p }_{,}_{ g }|’s for period p are sorted in decreasing order. Rank 1 corresponds to the largest of the |*m*_{ p }_{,}_{ g }|’s for period p.Let be the weight for gene

*g*in time period*p*, i.e. a measure of the difference between the two strata.*w*_{ p }_{,(}_{ g }_{)}= |*w*_{ p }_{,}_{ g }| where |*w*_{ p }_{,}_{ g }_{ʹ}| has rank*g*when the |*w*_{ p }_{,}_{ g }|’s for period p are sorted in decreasing order. Rank 1 corresponds to the largest of the |*w*_{ p }_{,}_{ g }|’s for period*p*.

In some approaches, we use the sample mean *m*
_{
p
}
_{,}
_{
g
} directly,
without ranking.

### The number of differentially expressed genes for a time period

We identify the significantly differentially expressed genes
in the time period using the Bioconductor R-package limma, linear
models for microarrays (Ritchie et al. 2015), where the response
is *X*
_{
g,c
}, i.e. the difference
in log2 gene expression between the case and the control of a case-control
pair.

In the next section, we describe how these statistics can be used for examining how gene expression varies over time, between strata, and between cases and controls

### Using information extracted for time periods

*Finding signal in the data*

The objective is to be able to identify small changes that vary slowly in time and/or between strata, by using a large number of genes in each hypothesis test and predictor.

For examining whether there are differences between cases and controls, between strata or in time, we test following hypotheses using the three statistics’ standard deviation, mean and weight defined in the previous section:

**H0-case-ctrl:** The expectation of *X*
_{
g,c
} is
zero. This means that there is no difference between the expectations
of the log_{2 }gene expression values for the cases and
controls. It implies that *f*
_{
s
}
_{(}
_{
c
}
_{)}
_{
,g
}(*t*
_{
c
}) = 0.
If the null hypothesis is false, the expectation will be different
from zero for some periods and genes. We test the hypothesis by
using the statistic *m*
_{
p
}
_{,(}
_{
g
}
_{)}.

**H0-time:** The distribution of *X*
_{
g,c
} is
not associated with the time to diagnosis. This means that the expectation
and standard deviation of *X*
_{
g,c
} are
the same in all time periods. It implies that *f*
_{
s
}
_{(}
_{
c
}
_{)}
_{
,g
}(*t*
_{
c
})
is constant in time. If the null hypothesis is false, the standard
deviation for some periods will be lower than the standard deviations for
the entire time period for some genes. Also, the absolute value
of the expectation for some periods will be higher than the absolute
value of the expectation for the entire time period for some genes.
We test the hypothesis first by using the statistic *s*
_{
p
}
_{,(}
_{
g
}
_{)},
and then by using the statistic *m*
_{
p
}
_{,(}
_{
g
}
_{)}.

**H0-node:** The expectation of *X*
_{
g,c
} is
not associated with stratum. This means that the expectations for
the two strata are equal for all genes *g* and time
to diagnosis *t*. It implies that *f*
_{
s
}
_{(}
_{
c
}
_{)}
_{
,g
}(*t*
_{
c
})
is the same for all strata. If the null hypothesis is false, the difference
in expectation will be different from zero for some periods and
genes. We test the hypothesis by using the statistic *w*
_{
p
}
_{,(}
_{
g
}
_{)}.

The null distribution of each statistic will be estimated by randomizing the data, and we compute p-values by comparing the statistic for the data to the estimated null distribution.

We will reject the H0-case-ctrl hypothesis if the hypothesis *m*
_{
p
}
_{,(}
_{
g
}
_{)}>0
is rejected for at least one time period *p* and
rank *g*, where *g* belongs to a subset
of the *N*
_{
g
} ranks. In
practice, we have chosen to let the subset of ranks consist of ranks
between approximately 1% and 25% of the number of genes, so that
the subset contains both relatively low and high ranks. This means
that H0-case-ctrl is rejected based on a very large number of hypotheses,
that are also highly positively correlated, and we therefore needed
to adjust for multiple testing. The approaches for rejecting the
H0-time and H0-node hypotheses are similar. Besides rejecting the three
null hypotheses, the hypothesis tests for the statistics for each
time period and rank can be used for illustrating how the p-values
are associated with the time to diagnosis.

### Changes in the number of differentially expressed genes over time

In this approach, we examine changes in gene expression over time by examining how the number of differentially expressed genes between cases and controls varies with time. The time curve in this case consists of the number of differentially expressed genes in each time period. Such time curves give an indication of when there is a large difference between cases and controls before or after diagnosis. For comparing different strata, we can compare the time curves of the strata.

When testing whether the number of differentially expressed genes
are different for two strata, we use *n*
_{
s
},
the number of genes that are differentially expressed between cases
and controls in at least one time period, as test statistic for
stratum *s*.

When comparing stratum *a* and stratum *b*,
we cannot directly compare *n*
_{
a
} and *n*
_{
b
} if *M*
_{
a
},
the number of case-control pairs in stratum *a*,
is much larger than *M*
_{
b
},
the number of case-control pairs in stratum *b*.
To use the same number of case-controls pairs when comparing the
strata, we test the following hypothesis.

**H0-strata1:** The number of differentially expressed
genes between cases and controls is different for stratum *a* and
b. Assume *M*
_{
b
} < *M*
_{
a
}.
We want to estimate the null distribution for *n*
_{
a
} when
the sample size of stratum *a* is *M*
_{
b
},
and then compare this distribution to *n*
_{
b
}.
The null distribution is found using simulation by repeatedly sample *M*
_{
b
} case-control
pairs from stratum *a*, and then compute the number
of differentially expressed genes for each sampled dataset. The
p-value of the hypothesis test is computed by comparing the samples
of the null distribution to *n*
_{
b
}.

### Identifying significant genes based on area between curves

In this section we also estimate the function *f*
_{
s
}
_{(}
_{
c
}
_{)}
_{
,g
}(*t*
_{
c
}) = *E*{*X*
_{
g,c
}}
by taking the mean of observations of *X*
_{
g,c
} in
an interval of time. However, here we analyze the properties of
the function in the entire time period at the same time instead
of for each interval in time separately. Then we are able to analyze
all the data from a gene instead of focusing on properties in an
interval. This may be used to identify genes that are significantly
different between strata and predict strata from the gene observations.

We estimate the area between two curves *f*
_{
a
}
_{,}
_{
g
}(*t*)
and *f*
_{
b
}
_{,}
_{
g
}(*t*)
by the test statistics *V*
_{
g
} = |*f*
_{
a
}
_{,}
_{
g
} – *f*
_{
b
}
_{,}
_{
g
}| = ∫|*f*
_{
a
}
_{,}
_{
g
}(*t*) – *f*
_{
b
}
_{,}
_{
g
}(*t*)|dt
where the two strata are denoted *a* and *b*, respectively.
This is equal to the weighted sum of the absolute value of the differences
in average gene expression between the two strata in each time interval, where
the weight depends on the length of the time interval. The area
is large if there is a large difference between the curves and we
neglect whether this is due to different average value or one is
increasing and the other is decreasing in time.

We test the following hypothesis:

**H0-strata2:** The functions *f*
_{
a
}
_{,}
_{
g
}(*t*)
and *f*
_{
b
}
_{,}
_{
g
}(*t*)
are equal. For each gene g, we compare the observed *V*
_{
g
} with
the same variable from a simulated distribution where we resample
the variables *X*
_{
g,c
} for
all the genes simultaneously by randomizing the stratum *s*(*c*)
for the case-control pairs. We maintain the observations for each
gene and the number of observations from each stratum. We have made *N*
_{
g
} simultaneous
tests and need to use the methods for adjusting for multiple testing.

### Prediction of stratum based on local statistics

The weights *w*
_{
p
}
_{,(}
_{
g
}
_{)} can
be estimated for each rank *g* from data in period
p for a training dataset. The stratum of the case of a new case-control
pair, i.e. a case-control pair that does not belong to the training
set, can then be predicted based on the score

where *x*
_{(g)} is the difference in log_{2 }gene
expressions of the new case-control pair and *δ*
_{
p
}
_{,(}
_{
g
}
_{)} is
1 if the weight *w*
_{
p
}
_{,}
_{
g
}
_{ʹ} is
positive, and –1 otherwise, where |*w*
_{
p
}
_{,}
_{
g
}
_{ʹ}| = *w*
_{
p
}
_{,(}
_{
g
}
_{)}. The
n genes with highest absolute value of the weights are used for
computing the score, where n is a number less than or equal to the
number of genes, *N*
_{
g
}.
Large values of *z* indicate that the new case belongs
to stratum *a*. If *z* > *c*,
for some arbitrary threshold *c* we conclude that
the new case belongs to stratum *a*, otherwise we conclude
that the new case belongs to stratum *b*. We may
set *c* = 0 if it is not more important to avoid
false classification in one stratum relative to the other and if

where *m*
_{
p
}
_{,(}
_{
g
}
_{),}
_{
a
} and *m*
_{
p
}
_{,(}
_{
g
}
_{),}
_{
b
} are
the sample means that are used when computing *w*
_{
p
}
_{,(}
_{
g
}
_{)}.
Increasing (decreasing) *c* results in fewer false
positives (negatives) at the cost of more false negatives (positives).

The available datasets are too small to be divided into a training
and validation set. When predicting the stratum of the cases in
a dataset, we should therefore use a leave-one-out or *k*-fold
cross validation approach. When using the leave-one-out approach,
we predict the stratum of case *j* using weights *w*
_{
p
}
_{,(}
_{
g
}
_{)} that
have been estimated using the dataset where case-control pair *j* has
been excluded. The *k*-fold cross validation approach
is similar, except that we divide the dataset into *k* folds and
predict the stratum of the cases in fold *f* using
weights *w*
_{
p
}
_{,(}
_{
g
}
_{)} that
have been estimated using the dataset where the case-control pairs
in fold *f* have been excluded.

## Examples of use of the methods

In this section, we give examples from papers where the methods described above have been used.

### Finding signal in the data

In a previous methodological study, time was categorized in three non-overlapping periods (Lund et al. 2016); see Chapter 8. The aim in that study was to show that there is signal in the data, but without showing where in time the changes in gene expression occurred or which genes were involved. The main idea used in the paper is that genes can be grouped into curve groups, each curve group corresponding to genes with a similar development over time (Figure 6.1). Based on these curve groups, we tested a set of hypotheses that determined whether there is development in gene expression levels over time, and whether this development varies among different strata. For a breast cancer dataset in the Norwegian Women and Cancer (NOWAC) postgenome cohort, the curve group analysis revealed that development of gene expression levels varied in the last years before breast cancer diagnosis, and that this development differed by lymph node status and participation in the Norwegian Breast Cancer Screening Program. The effect of the participation may be due to different treatment for the participating women representing the majority of the population.

Partly to be able to better identify the points in time relative to cancer diagnosis where changes in gene expression occur, we developed methods based on a moving window in time. This approach includes time in a more continuous manner than the approach based on curve groups. In Holden et al. 2017 we used moving windows and randomization for the same dataset as we used in Lund et al. 2016, Chapter 8. The null hypotheses of no differences between cases and controls, no time-dependent changes, and no differences between different strata were all rejected. The main conclusion of the analyses was that there are time-dependent changes of the blood transcriptome up to eight years before breast cancer diagnosis.

### Changes in the number of differentially expressed genes over time

In Chapter 9 we examined the changes in gene expression after
diagnosis for a breast cancer dataset in the Norwegian Women and
Cancer (NOWAC) postgenome cohort. We stratified stage in invasive
or metastatic breast cancer, and vital status in dead or alive at
the end of follow-up, and observed a significant increase of differentially
expressed genes among women with metastatic disease who later died
both compared to invasive cases that survived (*p* = 0.001)
and to metastatic cases that survived (*p* = 0.024).
To illustrate the difference between strata, we made heatmaps for
the most differentially expressed genes over time; see Figure 6.2 below.

We also observed a second transient increase in blood gene expression a few years after diagnosis in metastatic cases, hypothetically representing a capitulation of the immune system.

### Identifying significant genes based on area between curves

The analysis of complex curves was extended in order to identify genes that are significantly different between two strata The method tests differences in a non-parametric time development relative to time of diagnosis of the gene expressions from different strata using the area between the curves in a long time period.

The method was tested on case-control differences in log2 gene
expressions in a post diagnostic time period separating between
the women who survived and the women who died of breast cancer.
The method clearly showed non-linear changes, with rapid transient
mostly increasing fold changes, in cases who later died. Survivors
had no changes. For cases that died, this transient increase was
followed by a regression towards the gene expression profiles of
survivors. For 9786 genes, the integrated area from 18 months to
8 years was highly significant (*p* < 0.00001) among
women who died. There were indications of a stronger relationship
in metastatic cases alone.

Figure 6.3 below shows the *f*
_{
s
}
_{(}
_{
c
}
_{)}
_{
,g
}(*t*
_{
c
})
curves for the 100 most significant genes in the period after diagnosis
for the women that survived and died of breast cancer.

This systems epidemiology approach provides a proof of concept for the use of gene expression as an individualized biomarker of prognosis related to death or not. Since we had no prior knowledge of the shape of differences in gene expressions as a function of time relative to diagnosis, we needed a non-parametric model that identified all possible changes in trajectories. The aim of the study was to explore single gene expression trajectories from immune cells in blood over the first years after diagnosis as predictors of later vital status, dead or alive.

### Prediction of stratum based on local statistics

In Holden et al. 2017, we described *Prediction of stratum
based on local statistics*, to illustrate how the predictive
power of the test varies with time. In Figure 6.4 below, we observe
that for screening detected cancers the probability of correct prediction
of metastasis status was best in year 1 before diagnosis compared
to year 3 and 4 before diagnosis for clinically detected cancers.

### Identifying significant genes with only one time period

The mortality of breast cancer is strongly associated with parity, i.e. the number of children to whom a woman has given birth (Lund 1990). In Lund 2018, we used one of the breast cancer datasets in the Norwegian Women and Cancer (NOWAC) postgenome cohort mentioned in the previous sections to examine whether there is an association between gene expression and parity. In that study we used only one time period, which means that we ignore information about time and as a consequence will mainly be able to identify genes with expressions that do not vary with time in the years before/after diagnosis. As there is a large body of evidence demonstrating the long-lasting protective effect of each full-term pregnancy (FTP) on the development of breast cancer (BC) later in life, this is reasonable.

We used the Bioconductor R-package Limma, linear models for microarrays (Ritchie
et al. 2015), to identify the genes that were influenced by parity.
In the linear model, the responses were *X*
_{
g,c
},
the differences in the log2 gene expression for each case-control
pair, while we included the parity of the control and the parity
of the case as covariates. In the analyses, we merged parities 1–3
and 4–6 so that the parity data consisted of three different values:
0, 1–3, and 4–6. The merging was done in order to reduce the effect
of the highest parities. We identified gene sets that were influenced
by parity using Limma in the same way as we did for individual genes,
by using enrichment scores for gene sets instead of differences
in the log2 gene expressions as responses in the linear model. The
enrichment scores for gene sets were obtained from the *X*
_{
g,c
}s
using the Bioconductor R-package gene set variation analysis, GSVA
(Hänzelmann et al. 2013).

We found that 756 genes showed linear trends in cancer-free controls, false discovery rate (FDR) 5%, but this was not the case for any of the genes in the breast cancer cases. Gene Set Enrichment Analysis, GSEA of immunologic gene sets, C7 collection in Molecular Signatures Database, MSigDB (GSEA MSigDB, Subramanian et al. 2005) revealed 215 significantly enriched human gene sets (FDR 5%). These marked differences in gene expression and enrichment profiles of immunologic gene sets between breast cancer cases and healthy controls suggest an important protective effect of the immune system on breast cancer risk.