9. Signals of Death—Post- Diagnostic Single Gene Expression Trajectories in Breast Cancer—A Proof of Concept

Using the time-dependent dynamics of gene expression from immune cells in blood, we aimed to explore single gene expression trajectories as biomarkers for death after a diagnosis of breast cancer introducing a new statistical method denoted Difference in Time Development Statistics (DTDS). This shows as proof of principle that the gene expression profiles from immune cells in blood differed in the postdiagnostic period are dependent on later vital status.

The gene expression analyses of 394 breast cancer cases and age-matched controls were obtained from the Norwegian Women and Cancer (NOWAC) postgenome biobank (N = 50 000) performed in blood taken 0-8 years after a breast cancer diagnosis. The tube contained a protective buffer that preserved the mRNA in the blood. Cancer diagnosis and cause of death were based on linkage with the Norwegian Cancer Registry. The new statistical method was designed to test the difference in the time development between two strata using a non-parametric representation of the time development of the gene expression and used the area between the curves, i.e. the integral between the cures, as test statistics.
The time-dependent curves or trajectories exerted clearly non-linear changes with rapid transient mostly increasing fold changes, in cases who later died. Survivors had no changes. For cases who died this transient increase was followed by a

INTRODUCTION
In 2017, the number of cancer deaths in Norway exceeded that of cardiovascular deaths for the first time (Norwegian Institute of Public Health, Norway, 2018). While the number of cancer deaths has remained fairly stable over recent years, the number of cardiovascular deaths has decreased rapidly. This points to the urgent need for further improvements in cancer treatment for an ageing population. For women in Norway, breast cancer is the most common invasive cancer, constituting 23% of all cancers diagnosed among women in 2017 (Cancer Registry of Norway, 2018). Although significantly improved, the majority of breast cancer deaths are due to metastasis, not the tumor. One hundred years ago the survival for women with metastatic cancer was only 5% after five years, while today the ten-year survival rate of metastatic breast cancer is 85% (Reddy et al., 2018). In order to further improve cancer diagnosis, personalized treatment is moving forward (Jeibouei et al., 2019). Individualized treatment should be based on predictors for individual outcome. The potential of immune response has become evident through the recent use of immune therapy (Stroncek et al., 2017). Biomarkers in blood or liquid biopsies could be functional genomics i.e. transcriptomics or methylation, or metabolites or proteins.
We proposed the compilation of time trajectories of gene expression in blood from many independent case-control pairs as a potential liquid biopsy in order to study the impact of the immune system on carcinogenesis (Lund et al., 2016). A gene's trajectory corresponds to a curve that represents the changes in gene expression as a function of time, consisting of differences of gene expression between many case-control pairs. Healthy controls establish the level of expression for genes not involved in carcinogenesis, and is assumed to be constant over time. Genes related to the immune system and/or carcinogenesis (expressed in cases) may change over time. Lack of a priori knowledge of the shape of the trajectories demands an agnostic approach (Spitz & Bondy, 2010) including adjustment for multiple testing (Reiner, Yekutieli, & Benjamini, 2003). Gene expression is analyzed as a potential biomarker of carcinogenesis/metastasis, and the statistical quantity of interest is the distribution of the gene expression as a function of time after diagnosis.
In a recent study of gene expression profiles in the years after diagnosis stratified on clinical stages significant differences in the overall gene expression profiles were found (Lund submitted PLOS).
The aim of this study is to explore single gene expression trajectories from immune cells in blood over the first years after diagnosis as predictors of later vital status, dead or alive. In order to use the cumulated evidence over time for clinical follow-up a new statistical method, denoted Difference in Time Development Statistics, was developed; see below.

METHODS
This new statistical method, denoted Difference in Time Development Statistics (DTDS), is designed to test differences in time development in a non-parametric manner of two variables or the same variable for two different strata. In this paper, the method is used in order to identify genes where the gene expressions in blood samples have a different time development after diagnosis of breast cancer. The dataset consists of case-control pairs in which the case is diagnosed with the disease and the control is healthy. The data is the difference in log2 gene expression in blood samples between the case and the control. The gene expression profiles that are measured represent an aggregate of the transcriptional activity of all the blood cells at the time of blood collection. The DTDS method will be used on the postdiagnostic or clinical follow-up in the NOWAC postgenome cohort, where each blood sample, regardless of disease status, was collected at a random followup time. We will first describe the epidemiological design necessary for studies of the postdiagnostic trajectories, and then describe the statistical concepts.

MATERIAL
The overall NOWAC postgenome biobank Recruitment for the prospective Norwegian Women and Cancer (NOWAC) study started in 1991 . Women were randomly sampled from the Norwegian population register in Statistics Norway. The women were mailed a letter of invitation and a questionnaire. Follow-up was based on linkage to the Cancer Registry of Norway and the register of deaths were based on the unique national birth number given to all Norwegian inhabitants. Repeat questionnaires were mailed with intervals of some years. In the years 2002-2006, women were invited to participate in a subcohort, the NOWAC Postgenome cohort study; for further details see Dumeaux et al., 2008. The main purpose was to establish a biobank suitable for analyses of functional genomics, in particular transcriptomics. Random samples of NOWAC women were drawn in weekly batches of 500 women until 50 000 women had responded positively. Women were invited to fill in another questionnaire and donate a blood sample at a health-care institution such as a GP's office. The blood samples were sent overnight to the institute by special post for biological samples. The tube contained a protective buffer that preserved the mRNA in the blood (PAX gene blood RNA system), allowing frozen storage over time and optimizing sensitivity of the analysis.
The present analysis used a subsample of the NOWAC postgenome biobank participants. Women who had both filled in a questionnaire in 1996-1998 and had given a Postgenome blood sample were eligible, a total of 31 101 women. Since collection of blood was at random without knowledge of disease status, the procedure gave a uniform distribution of gene expression measurement over time.
In 2010, breast cancer cases diagnosed between 1996 and 2006 were identified through a linkage to the national cancer registry. An age-matched control was drawn at random from the same batch of 500 women. A total of 394 incident breast cancer cases were identified. Those rendered non-eligible were six technical outliers, seven cases with unknown metastases, seven cases with another incidence of breast cancer before blood collection, ten controls diagnosed with cancer before blood donation, and one who emigrated, leaving 363 case-control pairs for the present analyses.
A linkage to the register of vital status in Norway gave a complete follow-up after blood donation until the end of the study on 31.12.2014, or death or emigration. Causes of death according to different strata of metastatic/invasive cancer at time of diagnoses are given below.
In order to update changes in clinical stage or a second breast cancer and to remove controls with an incidence of cancer, another linkage was performed in 2018 with the Cancer Registry of Norway. For six women with metastases and ten cases with another incidence of breast cancer, the updated information was used to change the start of follow-up.
Of the 363 case-control pairs, 85 were omitted since the follow-up time for the cases that are observed before 18 months from diagnosis are heavily influenced by the treatment. We therefore first analyze a dataset of 39 cases who died from cancers and compare them with 239 cases who did not die of cancer, i.e. a total dataset of 278 case-controls. Later, we reduce this to a dataset consisting of 23 cases with metastatic breast cancer who died of breast cancer and 79 cases with metastatic breast cancer who did not die of cancer; see Table 9.1.

STATISTICAL METHODS
The dataset consists of two strata of women with breast cancer in which the cases died or did not die of cancer and the observation time is the time after the last diagnosis. For each gene and stratum, we estimate the differences between cases and controls in gene expression as a smooth function using a moving window in time. We then estimate the differences in the time development between the two strata by calculating the area between the two estimated curves for the smoothed gene expression for the two strata. If there is a systematic difference in the level or the time development of the gene expression between the two strata, this area is large. We will test three hypotheses. The first hypothesis, H0A, concerns individual gene trajectories, while H0B looks at all genes together. We also predict the vital stage, dead or alive, of each case using cross-validation. H0C states that this prediction is independent of vital stage.

H0A: Identify genes with different time development
We first focus on identifying genes with a different time development. Let X c,g be the difference in log2 gene expression for case-control pair c = 1,2, ... , M for gene g = 1,2, ... N g . Further, let t c be the time of observation relative to diagnosis for the case in the case-control pair c. We assume X c,g~N (f s(c),g (t c ), σ g ) where σ g is the standard deviation and s(c) is the stratum of case c. We estimate the function f s,g (t) by taking an average of the observations X c,g from stratum s(c) in an interval that includes the n nearest observations in time, i.e. the n/2 observations with largest t c but t c < t and the n/2 observations with smallest t c but t c > t. The number n is a tradeoff between precision and resolution. It should be large enough that the estimate in an interval should not depend on a single data point and at least smaller then M/4 in order to get resolution in time. If there is a large difference in the time development between the two strata, the test statistic or area V g = |f a,g - |dt describing the area between the curves, will be large where the two strata are denoted a and b, respectively. This estimate is the sum of the absolute value of the differences in average gene expression between the two strata in equally spaced time points assuming the controls have similar values. The integral is evaluated in a time interval where there are observations from both strata. We make N g , independent hypotheses, i.e. one hypothesis for each gene: For each gene g, we compare the observed V g,o with the variable V from a simulated distribution where we use a standardization of the same variables X c,g for all the genes simultaneously, but where we randomize the strata s(c) of the cases. We maintain the observations for each gene and the number of observations from each stratum. From the N u simulations, we estimate the probability distribution g(v) = P(V > v) that is independent of the genes. Based on this, we find a p-value We correct for multiple testing using the (Benjamini & Hochberg, 1995).
We estimated the functions f s,g with a moving average, where the window size is one-quarter of the respective datasets, i.e. 9 and 59 points, respectively. These functions were evaluated in regularly spaced points, making it easy to evaluate the functions when the observations for each stratum changes position in time. The integral was evaluated in the largest interval such that there were data points from the two strata before and after the interval making the interval equal to (547, 2255) days after diagnosis. The method was applied on a dataset with N g = 8400 genes. The analysis is performed for standardized gene expressions for each gene where the standard deviation σ g is taken over the case-controls pairs for each gene. This normalization is necessary in order to compare area between the curves since we want to focus on the differences in time development and not in the mean value and the variance. The results were based on simulations with N u = 1000 realizations. We will also make a weaker hypothesis where we analyze all the genes simultaneously: H0B: For all genes, the time development of X c,g is independent of the stratum s(c), i.e. for all genes f a,g = f b,g .
Note that we only make one hypothesis here. We perform the same N u simulations as in the hypothesis test for H0A, but we use the test statistics V (1),o > V (2),o > ··· which is the V g,o variables for all the genes that are sorted in decreasing order. From the simulation, we find the probability for the ordered variables g m (v) = P(V (m) > v) for m = 1,2, ..., and the p-value for the hypothesis test Here, we have many highly correlated test statistics V (m),o for m = 1,2, ..., for testing the same hypothesis H0B. The integer m is chosen by the user. m = 1 means that we are only interested in the most extreme gene and m = 10 means that we are interested in the 10 most extreme genes. This method is most interesting for 3 < m < 50, i.e. where no single gene is significant, but several/many genes have deviating values and where we avoid the multiple testing problem. Ideally, m should be decided before the data is analyzed, but this is not as critical as when alternative test statistics are independent of each other.

H0C: Prediction of strata
It is also possible to use the same technique in order to predict the stratum of a case. The idea is to find out if the observations X c,g for g = 1,2, ..., N g is closest to f a,g (t c ) or f b,g (t c ) for the genes with lowest p-values in the hypothesis test H0A above. Our ambition is only to find the quality of the prediction, not to make a diagnosis for each case. Hence, we make the following hypothesis: H0C: The prediction P a,c , that the case c belongs to stratum a, is independent of the stratum s(c).
We test the hypothesis using cross-validation. The case-control pairs are divided into the D 1 , D 2 , ... D Nd groups, which are described in more detail further down. For each of the pairs c ϵ D k we find where f a,g,k (t c ) is the estimated gene expression for gene g and stratum a at time t c , i.e. the time of observation X c,g based on all the data except the data in D k . This is based on the assumption that X c,g~N (f s(c),g,k (t c ), σ g ). The weight w g may be set equal to , possibly modified based on the correlation between the gene expressions for different genes and how significant this gene is for the prediction. How important gene g is for the prediction is estimated from p g,k , the p-value for hypothesis test HOA estimated from all the data except D k . The prediction that the observation X c,g is from stratum a is then This model gives probabilities that are approximately uniform in (0,1), see Figure  9.1. If we had assumed X c,g~N (f s(c),g,k (t c ), σ g ) independently for each gene g, then most P c,a would be close to 0 or 1, which does not correspond to our ignorance in the classification. We use the test statistics which is the L 1 distance between the prediction for stratum a and the indicator for stratum a. We may randomize P c,a between the observations and find a distribution for S. The p-value for the hypothesis test H0C is found from the distribution for S, i.e. p = P(S < S o ).
We used cross-validation and therefore needed to divide the dataset into separate groups. The 39 case-control pairs where the case died of cancer and 239 casecontrol pairs where the case did not die of cancer were divided into 13 groups, D 1 , D 2 , ... D 13 . The data in each stratum was divided into three time periods for each of the two strata with an (almost) equal number of observations. Each of the 13 groups had (almost) the same number of observations from each stratum in each of the three time periods. For each group k we find the values p g,k from the hypothesis test H0A based on all the data except the data in D k based on 1000 randomizations of the strata s(c).

RESULTS
The data used in all the analyses are the differences in log2 gene expression between cases and controls in the period after diagnosis that are shown in Table  9.1, i.e. 278 case-control pairs.

Testing H0A
Results from testing the H0A hypothesis are shown in Table 9.2. The function of the top 10 is shown in Chapter 10.   As shown for most genes, the gene expression increases. Noticeably, f survived,g (t) is almost constant and close to 0 in the entire period while f died,g (t) is closer to 1 or -1 in the period (1000,1500) days and then for many genes closer to 0 after 1500 days. The normalization (1) implies that the data for each gene have average value 0 and standard deviation 1 in order to compare data between genes. Since the stratum that survived is much larger, it is natural that the average of these curves is smoother and close to 0. The statistical test shows that deviation in averages value between the strata is significant for many genes. The p-value depends on whether there is a systematic difference in level or time variation of the gene expression, not the size of the difference in average value between the strata since this is removed in the standardization. We also want to test all the genes simultaneously. Since we only make one test, it is easier to reject the hypothesis for a smaller dataset. First, we test hypothesis H0B on the dataset with 39 and 239 case-control pairs. There is only one hypothesis, but we have many different test statistics, one for each of m ordered V (m) test statistics for the area between the two curves. The different test statistics indicate whether there is a strong difference in the time development in one or a few genes compared to a smaller difference in many genes. The test for each of the ordered variables is highly correlated. Table 9.3 shows the p-values from the H0B. Notice that we get very significant results and that many genes have a different time development between the two strata. This test is also performed on the reduced dataset with only metastatic breast cancers. There are only 23 and 79 case-control pairs in the two strata (Table 9.1), those with metastases who died of breast cancer and those who did not die of cancer, respectively. We still get significant p-values, but much larger values than in the larger data set with both metastatic and invasive cancer; see Table 9.4. The differently ordered variables are highly correlated and give typically p-values between 3-6%.

H0C: Prediction of strata
We also want to test whether it is possible to predict the stratum of each case by testing hypothesis H0C. The 13 different datasets leaving out one of the groups D k give a slightly different rankings of the importance of the different genes. The mean correlation between the rankings of the genes for these 13 datasets is 0.85.  Table 9.4 shows that there is a large overlap in the most important genes in the 14 different datasets when we include the ranking using all the data. On average, four of the five genes with the lowest p-value when using the entire dataset were among the 10 smallest p-values in the reduced datasets. We have marked the five genes with the smallest p-values when using all the genes with colors. Notice that many of the same genes have small p-values for the different subsets. Table 9.4. Ranking of the 10 most important genes when we leave out D k from the dataset. The lowest line is the most important genes when we use all the data.
We have tested different predictions methods, i.e. different choices of the weights w g,k . The different choices give highly correlated probabilities. We have found out that w g,k = 1/p g,k for the 50 genes g with smallest p g,k value for each group is a quite robust choice. Figure 9.3 shows the predicted probabilities for each of the 278 casecontrol pairs after time of follow-up. Ideally, we wanted all the 39 red and yellow circles to be equal to 1 and the remaining circles equal to 0. Based on these variables, we find P c,a , S o and the p-value p = P(S < S o ) based on the 10 000 randomization of P c,a We find the p-value less than 0.004 indicating that the prediction is far from random. Table 9.5 gives another presentation of the prediction based on whether P c,a > 0.3 or not.  Notice that the invasive cases who died and the metastatic cases who survived have a relatively similar prediction which is between the prediction of the metastatic cases who died and the invasive cases who survived.

DISCUSSION
We have shown that the trajectories of gene expression after diagnosis of breast cancer were mostly significantly upregulated for hundreds of genes in the years after a diagnosis of metastatic breast cancer compared to invasive cancer, as shown in Figure 9.4. These signals may be considered as signals of an upcoming death due to cancer. Fewer genes were downregulated. After some years, most upregulated genes levelled off while downregulated genes slowly returned to the normal expression level. Among women with invasive breast cancer, no significant trajectories were found. These results were based on a new statistical approach using the differences in the area between the trajectories of gene expression between diseased and healthy women. For practical and economic reasons, only one single measurement at time of inclusion was available for each individual in the NOWAC postgenome cohort. Hence, the processual approach relies on the assumption that the gene expression in distinct individuals at different times before or after diagnosis is a consequence of the same carcinogenic process (Lund & Plancade, 2012). Semi-parametric models with time-varying covariates, e.g. the Cox model (Cox, 1972), cannot be estimated from a prospective design including only one unique measurement at time of inclusion, unless covariates are assumed to be constant over time. Consequently, this assumption would not allow us to address changes in gene expression over time.
The DTDS is a further development of the LITS method (Holden, Holden, Olsen, & Lund, 2017), where we used a moving window and summary statistics for all genes for each of the stratum and time period. The genes that were significant in each time interval varied between the intervals, making the LITS method not suitable for identifying genes with different time development. In contrast, the DTDS method is able to identify genes with different time development. Both methods use the same method for simulation and randomization of gene expressions between the case-control pairs with cases from the different strata.
The distribution of measurements of gene expression must follow a constant function, i.e. with measurements spaced over the time interval. Most cohort studies have repeated measurements, but usually they are collected for all participants with several years of spacing and can be used as repeated measurements only.
We cannot predict the outcome for single individuals, only on a group level. The results can be looked upon as a proof of concept for the idea that gene expression measured repeatedly over time after diagnosis can be used as a predictive test for the vital outcome.
Little is known about the changes in gene expression in the blood in the period after a breast cancer diagnosis, i.e. the time period after the primary treatment (Lund & Plancade, 2012). In the stratified analysis, both invasive and metastatic cases were compared to healthy women without known cancers. The consistent and highly significant differences between the two strata adds information that can be used toward a new hypothesis of metastatic breast cancer and its high lethality. For hundreds of genes, the integrated area between the two curves for each stratum accumulates during follow-up, indicating ongoing dysregulation of important genes. These strong changes in gene expression from the immune cells can be viewed as signals of upcoming death. The intention here was to explore the unknown trajectories of gene expression after diagnosis of breast cancer. The interpretation of each gene was outside the scope of our exploration. Still, some hypotheses can be put forward.

Human model of carcinogenesis-interpretation of highly expressed genes
No unifying theory exists for human carcinogenesis; the number of proposals is many (Vineis, Schatzkin, & Potter, 2010). To date, most mechanistic or pathways analyses have been experimental in-vitro or animal studies. With the increasing knowledge about human carcinogenesis in tumor tissues or in blood at time of diagnosis, some disturbing facts about the validity of the animal models for human carcinogenesis have been brought up. First, the biology of mice and men is comparatively different (Mak, Evaniew, & Ghert, 2014;Anisimov, Ukraintseva, & Yashin, 2005), and a controversial Nature editorial ("Of men, not mice", 2013) advocated the need for human functional studies. Similarly, the translational value of mouse models in oncology drug development was recently questioned (Gould, Junittila, & de Sauvage, 2015). While cancer can be manufactured in mice quite easily, these models do not necessarily apply to humans (Mak et al., 2014). Consequently, an increasing number of studies use functional genomics as biomarkers, looking both at the exposure relationship and the outcome. While interesting, this approach lacks the distinct focus on the time-dependent process of carcinogenesis. Few, if any, prospective studies have been designed for longitudinal analyses of functional genomics related to the processes of carcinogenesis and metastasis.

CCM2
Regulate angiogenesis and formation of new blood vessels

C14orf45
Gene responsible for cilia orientation. One paper shows as low-expressed gene associated with poor survival in BC (higher number of cylia is necessary for improved migration of breast cancer cells)

ARL4A
Increase cell migration CBX3 Shown to be overexpressed in BC and associated with low survival, might block differentiation and promote self-renewal of cancer stem cells The interpretation of these genes points towards important changes in genes known to be affected at breast cancer, and in addition some more general ones. During the different laboratory steps, several decisions had to be taken on level of noise and the use of specific distribution of noise. Further, since a gene maybe not expressed in all individuals, the percentage of cases or controls with sufficient signals had to be decided. The stronger the criteria moving towards hundred percent, the harder the exclusion.
The strength of the study is the unique biobank created with the purpose of gene expression analysis in peripheral blood. This gave a unique opportunity to study the immune response since the mRNA in blood came from immune cells. This opened for the view that the carcinogenic process not only included exposures to carcinogens, but also has an important counterforce in the immune system. This has been known for more than a hundred years, and today documented through the new immune therapies.
The design has been population-based with a complete follow-up on cancer incidence, emigration and death based on linkage to national registers using the unique national birth number given to all residents in Norway from 1960. In addition, we had access to updated information on metastases and second breast cancers in the time between inclusion and blood donation. This somewhat reduced the noise from carcinogenic processes hidden at the time of diagnosis.

CONCLUSION
In this systems epidemiology approach, we have given a proof of concept for the use of gene expression as an individualized biomarker of prognosis related to death or not. The design of NOWAC is population-based and the results should be validated in a more specific clinical setting. With improved technology and individual repeated measurements gene expression followed over time could offer a unique opportunity for personalized treatment of metastatic breast cancer.

FSTL4
Shown to be involved in BC cell migration in mice. Was discussed in relation to late distant metastases in BC here without any conclusions (Mittempergher et al., 2013) C5orf30 Known to be expressed in BC and especially in lymph-node metastases. Promote inflammation and hypothesized to reduce immune response against cancer cells

RBM4
Known tumor suppressor in BC