All submissions of the EM system will be redirected to Online Manuscript Submission System. Authors are requested to submit articles directly to Online Manuscript Submission System of respective journal.

Threshold Regression and First Hitting Time Models

Calvin L. Williams* and Chelsea Law

Department of Mathematical Sciences, Clemson University, Clemson, SC 29634-1907, USA.

*Corresponding Author:
Calvin L. Williams
Department of Mathematical Sciences, Clemson University, Clemson, SC 29634-1907, USA.
E-mail: the@iwu.edu

Received: 22/08/2015 Accepted: 13/10/2015 Published: 23/10/2015

Visit for more related articles at Research & Reviews: Journal of Statistics and Mathematical Sciences

Abstract

First hitting time models are a technique of modeling a stochastic process as it approaches or avoids a boundary, also known as a threshold. The process itself may be unobservable, making this a difficult problem. Regression techniques, however, can be employed to model the data as it compares to the threshold, creating a class of first hitting time models called threshold regression models. Survival data, measuring the amount of time before an event occurs, is widely used in modeling medical and manufacturing data. To analyze and model the data at hand, one commonly used method is the proportional hazards model, but this requires a strong proportional hazards assumption, one that is often lacking in practice. In place of the proportional hazards model, first hitting time models can be employed. First hitting time models do not require such strong assumptions and can be extended to become threshold regression models. Threshold regression has many advantages over the proportional hazards model, including its flexibility in both its assumptions and utilization and its application to stochastic processes so often evident in measuring survival. This paper describes the process of threshold regression modeling and compares its results and utility against that of the proportional hazards model. This approach is presented in a some interesting applications.

Keywords

Threshold regression; First hitting time models; Exact solution; Markov decomposition.

Introduction

First hitting time models are a technique of modeling a stochastic process as it approaches or avoids a boundary, also known as a threshold. The process itself may be unobservable, making this a difficult problem. Regression techniques, however, can be employed to model the data as it compares to the threshold, creating a class of first hitting time models called threshold regression models. Survival data, measuring the amount of time before an event occurs, is widely used in modeling medical and manufacturing data. To analyze and model the data at hand, one commonly used method is the proportional hazards model, but this requires a strong proportional hazards assumption, one that is often lacking in practice. In place of the proportional hazards model, first hitting time models can be employed. First hitting time models do not require such strong assumptions and can be extended to become threshold regression models. Threshold regression has many advantages over the proportional hazards model, including its flexibility in both its assumptions and utilization and its application to stochastic processes so often evident in measuring survival. This paper describes the process of threshold regression modeling and compares its results and utility against that of the proportional hazards model.

Section 2 gives a brief overview of survival data and some important vocabulary associated with the area of study. Section 3 outlines the basics of the proportional hazards model and describes its advantages and shortcomings. Section 4 describes the first hitting time model and its application to longitudinal and other types of data. Section 5 considers the specific case of the threshold regression model and the different ways it can be utilized for different data. Section 6 provides examples of threshold regression and compares the results to the proportional hazards model. Finally, Section 7 provides a conclusion and a short discussion of the results.

Survival Data

Survival data, or time-to-event data, measures the amount of time before some event, usually referred to as a failure, occurs. Survival analysis, encompassing the various methods used to model and analyze survival data, is useful in a multitude of areas, especially in modeling medical and engineering data.

Failure and Censoring in Survival Data

A failure can be any event of interest to the researcher, in many cases referring to a biological death or mechanical breakdown. In medical applications, a failure is usually the death of a patient from a disease or injury of interest, the time of diagnosis or the time when a patient is cured (i.e., the event of interest is full recovery from an illness). In engineering, failure most commonly refers to a piece of machinery no longer being able to adequately perform its function.

In most studies, a failure does not occur for one or more subjects. For example, a patient may remain in good health or a machine may continue to function properly at the end of the study. In these cases, the subject’s status is measured for the final time at a censoring time, usually when the study is completed, though individual subjects can have personal censoring times as well. These censored observations must be considered separately from those that meet the failure criteria, as will be reflected in the partial likelihood function for the process, as discussed later.

Observable and Latent Survival Data

Observable data, that which can be measured and recorded, are by far the most convenient to implement. For many studies involving human subjects, however, measurements may be missing or impossible to determine. Unobservable, or latent, data are the result of conditions that are too complex, subtle or difficult to record [1]. In these cases, other, observable variables must be used to obtain as much relevant information as possible about the missing measurements.

Proportional Hazards Model

Proportional hazards models are often used for prediction in survival analysis. These models require an appropriate hazard function as well as a strong proportional hazards assumption. This assumption states that the covariates are multiplicatively related to the hazard function [2]. The hazard function, h(t) , gives the instantaneous rate of failure as [3]:

image

where t is the time in question, T is the failure time, f (t) is the failure density function and s(t) is the survival function, i.e., the probability of surviving until at least time t .

The hazard functions in proportional hazards models are proportional to a baseline hazard,image which yields a useful property. Suppose we have subjects with covariate values imageand image. Then [3]:

image

Therefore the hazard ratio at any time t does not depend on t [2]. Under the proportional hazards assumption, the survival funcitons for observations with different covariate values do not intersect [3].

Due to the strong nature of this proportional hazards assumption, though, many proportional hazards models fail. Nonproportional hazards can be caused by unexplained similarities in the risk or covariates that change over time within subjects [2]. Using proportional hazards also limits the outcome of the model. Most notably, survival curves for a proportional hazards model cannot cross, so that if one group is outperforming another, that trend will remain intact for the remainder of the study, according to the model.

First Hitting Time Model

A first hitting time model explains event times using a stochastic process image that reaches a boundary B , also known as a threshold [3]. The first hitting time S of the data is given by

image

Note that {Y(t)} can be one- or multi-dimensional, could have different properties such as stationarity, independence of increments or a continuous sample path and has no guarantee of ever hitting the threshold, that is,

image

This stipulation accounts for censored subjects for whom image is greater than 0. First-hitting time models do not require the proportional hazards assumption, giving them more freedom of application than corresponding proportional hazard models.

For a latent process image , we have both process parameters and a baseline level. The process parameters, given as the vector image outline the mean drift μ and the variance image.For simplicity, image is usually set to equal 1. The baseline level of the process, image , is simply the value of Y(0)

Applicable Data for First Hitting Time Models

The data used usually consists of image subjects with independent health processes image where image is the final time recorded for subject j . The boundary sets of individual subjects may also vary, but for the purposes of this paper, we will assume a common threshold. For each of the n subjects, the data provide observation pairs

image

theimage longitudinal reading image on the level ofimage at time image,accompanied by an overall failure indicator image which is 1 if image and 0 if image. After the boundary set is reached, the level of the process, image is only read if image,that is, if it is the first hitting time of the subject [4].

Restricted Data in First Hitting Time Models

In many cases, longitudinal data are not available, so image for all j , and the only information available is the observation pair image and the failure indicator image,as above. In more extreme cases of restricted data, the only information available is image,with image conveying if the subject failed or was censored and when [4].

Likelihood Function for First Hitting Time Models

Each subject in a study contributes to part of the overall likelihood for the data, depending on whether the subject experienced a failure or was censored. If the subject fails at time S = s , then its contribution is given by

image

If the subject survives beyond the censoring time C , its contribution is given by

image

So, with a failure indicator image , we have the likelihood function

image

Markov Decomposition of Observations

Multiple observations in survival data for a single subject are often decomposed into a series of smaller entries covering a set amount of time with initial and final data entries, creating a longitudinal sequence that can be used in first-hitting time models [1]. When longitudinal data, sequential observations on the same measurable time-varying covariates, are used, some useful properties arise that simplify the process of predicting survival. The sequence of time points for which observations are obtained is denotedimage. This sequence ends when one of two events occur: the subject either fails or is censored.

The censoring time C will coincide with one of the time points in our sequence, so we can create both a failure code sequence

image

and a censoring code sequence

image

with image

These sequences lead to three possible code configurations at each time point. If the subject is surviving and has not been censored at time image,then image.If the subject fails on or before time image,then imageIf the subject has not failed by the censoring time, then image These sequences are used in conjunction with the observable process sequence for the threshold-crossing variable given by image and with the sequence of vectors of additional observed covariates, denoted image for each subject j . Our assumptions require that these values are observed at the censoring time, but not after a failure.

Under a Markov assumption, we can assign baseline covariate values image and image to time increments image to find the initial health status for the interval at time image.Let image be the last observation for a subject, that is, the first observation for which either image or image is 1.With events image,we have

image

by the Markov assumption. By construction, the longitudinal record is broken into a series of one-step records with initial conditions for the image entry given by image and image as the final conditions on the interval. Explicitly, this is written as

image

with image unless the initial conditions are random [1].

Types of First Hitting Time Models

Although many types of first hitting time models exist, some processes are used more often than others. One of the most common is the one-dimensional Wiener process with an inverse Gaussian first hitting time. The Wiener process has parameters μ and σ and begins with an initial value of Y(0) = 0 . When μ , the mean drift rate, is nonnegative, image,and the first hitting time S has an inverse Gaussian distribution. When μ is negative, image has an inverse Gaussian distribution with parameters image and σ [4].

Another common process is the two-dimensional Wiener model for a marker and first hitting time. Here, the process image is latent, but is accompanied by a marker process, image that covaries with {Y(t)} in time. Jointly, the two processes form the two-dimensional Wiener diffusion processimage that has initial values image and is distributed multivariate normally as

image

The threshold is set at 0, and the marker change process image is given by image [4].

The process is observed from the initial time 0 until the set censoring time C , so either the subject is surviving at C and the marker level image is recorded or the subject fails at some time S and image is recorded.

Threshold Regression

A class of first hitting time models known as threshold regression models incorporate covariate data into the original first hitting time model using regression structures for the parameters of the process used [2]. In threshold regression, the effects on the hazard vary with time, so the proportional hazards assumption is not necessary [2]. Threshold regression institutes a regression structure that utilizes the same image process parameters with mean drift parameter image and baseline level image. Here, image are values of the covariates image for the subject and g1 and g2 are selected link functions for the data at hand. Many different link functions can be used, including linear and polynomial combinations of image , semiparametric regression splines, penalized regression splines, random effects models and Bayesian models [3].

The three building blocks of threshold regression, then, are: 1) the process, image,which could be a Wiener process, a Gamma process or some other applicable process; 2) the boundary, which could be a straight line or some curve; and 3) the time scale, which could be based on calendar time, running time or analytical time.

Log-Likelihood Function for Threshold Regression

Threshold regression uses the log-likelihood function to determine survival time of a subject as follows:

image

where image is the failure indicator for subject j,tj is the censored survival time with image if the subject failed, f and F are the first hitting time probability density function and cumulative distribution function, respectively [3].

Let image be the event that the threshold B is eventually reached. Then the mean survival time for subject j can be found by

image

In Bayesian analysis, the nonidentifiability of these parameters can lead to either noninformative priors, which allows that parameter to drift to extremes, causing unstable estimates, or overinformative priors that do not allow for any Bayesian updating or learning [2].

Implementing Threshold Regression

In place of these problems, fixing image allows us to model the remaining two parameters, image using regression. Here, threshold regression reveals one of its greatest advantages: distinguishing between the two types of covariate effects [2]. Threshold regression designates its covariates as having an effect prior to the study, that is, on y0 , or as having an effect on the degradation of the subject, that is, on μ . A common regression structure uses the following equations:

image

image

where image is the covariate vector and image and imageare the respective covariate effects [2].

Examples

Using the SAS code from [5], we will investigate first a simple data set concerning myelomas, then a larger, similar data set regarding melanomas and finally an expansive data set dealing with ventilator-associated pneumonia in hospitals.

Myeloma Data

A hypothetical illustration of threshold regression involves 49 patients who have been diagnosed with myeloma, an accumulation of cancerous plasma cells. In this simple example, only three covariates accompany the time variable, measuring the time until death or censoring, and the failure indicator: the age of the patient at enrollment, the gender of the patient and the treatment applied to the patient [5]. These data are presented as highly restricted, so we are only given the value of these covariates at the end of the study and the time of failure or censoring. There is no opportunity to use longitudinal data or any Markov decomposition.

Using linear link functions, the program finds the coefficients for the expressions

image

image

The Newton-Raphson method is used to maximize the log-likelihood function

image

image

image

to retrieve the above coefficients. The nonlinear maximization procedure uses t -tests with image to determine which coefficients are significant, and we obtain the following Table 1.

statistics-and-mathematical

Table 1: The nonlinear maximization procedure uses t -tests with α = 0.05 to determine which coefficients are significant, and we obtain the following table.

Thus, we obtain the necessary regression coefficient estimates, yielding

image

image

Using the lifetest procedure in SAS, we can obtain survival plots. We now compare these plots (Figure 1). with those generated by our regression.

statistics-and-mathematical-Survival

Figure 1: Survival Curves by Gender.

According to our regression, gender is significant to both the initial health status y0 and the mean drift of the health process μ . Using the lifetest procedure in SAS, we see from the Kaplan-Meier survival estimates that males have a higher initial health status than females, but also that just before the median survival time the two curves cross twice. This means that near 3.3 years into the study, the males lost their health advantage, but quickly regained it. Males continued to out-survive females for the remainder of the study, with both groups having similar drifts, as evident by the close steepness in the two curves.

In our regression model, however, the females begin with a very slight advantage in initial health, but it is quickly lost when the female curve takes a steep turn downward. Both curves approach the threshold steadily, as in the lifetest procedure, and the median survival times are well predicted by the regression.

In our next set of curves ,(Figure 2), separated by treatment, we again focus on a variable that is significant to both y0 and μ . Using the lifetest procedure, we see that treatment 2 has a considerable disadvantage in initial health status. We also observe that all three curves cross each other multiple times, but are closest around 3.5 years after the study begins.

statistics-and-mathematical-Curves

Figure 2: Survival Curves by Treatment.

This behavior is also found in our regression model, where the three curves reverse their order: treatment 0 goes from being the strongest to the weakest all within year three, while treatment 2 does the opposite. In our model, treatments 0 and 1 have approximately the same curve until the end of the third year. In all cases, the differences in the treatments are not evident until at least three years into the study, and the threshold is reached for both treatments 0 and 1.

Melanoma Data

Our second data set, from a true study of melanoma patients, is observed in the same way as the previous data, that is, with only one set of observations per subject. Here, the time until death or censoring is measured along with several covariates: sex of the patient, amount of inflammatory cell infiltrate (ICI), an epithelioid cell type indicator (ecells), an ulceration indicator, thickness of the tumor (mm) and age of the patient.

Again, linear link functions were used to solve for the coefficients of

image

image

The same method is used to maximize the same likelihood function, and the coefficients are found and presented in the following Table 2, with image , :

statistics-and-mathematical

Table 2: The coefficients are found and presented in the following table with α = .05 .

These coefficients yield the functions

image

image

image

image

Our regression tells us that ulceration is significant only for y0 , the initial state of the health process of the patients. That difference is evident in the curves (Figure 3) provided by the lifetest procedure in SAS, but there also seems to be a rather large discrepancy between the drift of the two curves, which is absolutely not true in our regression. In the Kaplan-Meier estimates, the two curves do not cross, while in our regression model, the two curves are exactly the same. This holds for the insignificance of drift in our model, but does not account for the differences in initial health. The regression model does, however, accurately predict the median survival time for ulcerated patients and demonstrates that neither group reaches the threshold.

statistics-and-mathematical-Ulcerations

Figure 3: Survival Curves by Ulcerations.

The regression model states that the age of the patient is significant to both μ only. The age variable breaks the data into smaller groups, making some of the curves less informative. Groups with small sample sizes, such as those over 80 and those under 20, have curves that react strongly to a small number of failures, so these curves are often misleading in the Kaplan-Meier estimates. Accounting for this, we see that the differences in initial health are mostly due to these smaller groups and that there is otherwise a rather straightforward correlation between age and drift: the older the patient, the steeper the survival curve. It is also evident that earlier in the trial, before 1000 days, the survival curves of most of the age groups are rather tight, with much crossing and overlap, but after the 1000 day mark, the curves are considerably more separated, demonstrating that as time progresses, age plays a greater role in recovery and failure. It should also be noted that around 3000 days, the 40-80 year-olds have similar curves, but return to their previous pattern as the study progresses.

Our regression does not demonstrate any differences in initial health status, but is consistent with the major pattern found in the Kaplan-Meier estimates. Again, due to smaller sample sizes, the median survival times are not well-predicted for our data, but all the curves (Figure 4) stay well above the threshold, as is consistent with the data.

statistics-and-mathematical-Age

Figure 4: Survival Curves by Age.

Ventilator-associated Pneumonia Data

In a recent study of ventilated patients taken to the intensive care unit (ICU) at a local hospital, 246 patients were observed to measure the time until ventilator-associated pneumonia occurred. Again, no longitudinal data were used, but one observation per subject was recorded with the time variable observed being the length of time on the ventilator and an indicator variable stating whether or not the patient contracted ventilator-associated pneumonia. The other covariates of note include: a chlorohexidine indicator (chx), the length of stay in the hospital in days (hstay), the length of stay in the ICU in days (icustay), the intubation site (intsite), an unplanned extubation indicator (uext), a reintubation indicator (reint), the age of the patient, the sex of the patient, the race of the patient, the admission diagnosis of the patient (adx) and a co-morbidity COPD indicator (copd). Note that a failure in this example is not a death due to a disease, but simply being diagnosed with ventilator-associated pneumonia, demonstrating that “failure" can have a number of meanings in first hitting time models.

Once more, linear link functions were used to solve for the regression coefficient given in

image

image

image

image

These link functions yielded the results in the following Table 3 with α = .05 :

statistics-and-mathematical

Table 3: Linear link functions yielded the results in the following table with α = .05

So we have the functions

image

image

image

image

image

image

We will again compare the survival curves (Figure 5) for our regression with the Kaplan-Meier estimates generated by the lifetest procedure in SAS.

statistics-and-mathematical-Length

Figure 5: Survival Curves by Length of Stay in the ICU.

The regression output states that a patient’s length of stay in the ICU is significant only for the initial health status y0 . This variable is too varied to accurately depict all of the data, but with these limited sample sizes, we can see that the Kaplan-Meier estimates all have similar slopes for the first two weeks of the study. Furthermore, we can see that the initial health statuses do indeed differ, with those staying in the ICU for approximately two weeks outperforming those staying for three weeks by out three days, those staying for one week by about nine days and those staying for less than a week by more than ten days. In the Kaplan- Meier estimates, the only group reaching the threshold are those staying in the ICU for approximately one week.

In our regression model, all three curves have nearly the exact same drift, with very similar steepness of the curves. The differences in initial health are evident from the length of time each curve takes to first drop from the 100% survival rate. As in the Kaplan-Meier estimates, the curves for one week and three weeks are nearly identical, while that for patients staying in the ICU for approximately two weeks has the highest median survival time, around 13.5 days. It is noteworthy that a patient staying in the ICU for three weeks can still have a failure time prior to 21 days into the study, as a failure in this case does not indicate a death, simply the contraction of a disease.

The admission diagnosis of a patient is perhaps the most informative of the variables in this study, as it accounts for much of what happens to the patient after arriving at the ICU. The admission diagnosis is considered by our regression to be significant for both y0 and μ . The Kaplan-Meier estimates provided by the lifetest procedure display clear differences in the initial health statuses of patients from all eight groups, but it should again be noted that small groups sizes for some diagnoses, especially 6, 7 and 8 (post-operative, sepsis and miscellaneous, respectively), make for dramatic turns or a complete lack thereof in some curves. Even with a large sample size, though, patients admitted with cardiovascular diagnoses were then diagnosed with ventilator-associated pneumonia very quickly and often. The cardiovascular group reaches the threshold by day 11.

The regression model shows that the three most common diagnoses, neurology, pulmonary and cardiovascular, have closebut distinct curves (Figure 6), all of which reach the threshold by day 18. Patients initially diagnosed with neurological problems have an advantage over the other two in initial health, but have a steeper curve overall, showing the differences in mean drift of the processes. The differences in the drift for all the groups increase with time, as demonstrated by the drastic spread of the curves after day 8.

statistics-and-mathematical-Admission

Figure 6: Survival Curves by Admission Diagnosis.

Finally, our regression marks reintubation of the patient as a significant variable for μ only. In the Kaplan-Meier estimates, reintubated patients have some advantage in initial health status, but the curve drops quickly, reaching the threshold by day 18. Patients for whom the variable was not applicable, meaning there was no unplanned extubation to reverse, decreased the most slowly, indicating that avoiding reintubation and unplanned extubation altogether will result in the strongest survival curve (Figure 7).

statistics-and-mathematical-Reintubation

Figure 7: Survival Curves by Reintubation.

The regression model seems to display three different initial health statuses, but the significant differences lie in the drift of the processes. As expected, once the reintubated patient’s curve begins to drop, it drops quickly and almost straight down to the threshold. The small size of this group could lead to this dramatic drop, which is almost exactly mirrored by the non-reintubated patient’s curve approximately three days earlier. The large group, for whom reintubation was not an issue, has a less steep slope to its curve, overtaking non-reintubated patients by day 8 and reintubated patients by day 11, proving its significantly different mean drift.

Conclusion

Although the proportional hazards model has its uses, it relies too heavily on the proportional hazards assumption, which can fail quickly in application. In implementing threshold regression, a specialized case of the first hitting time model, this assumption is avoided, creating more opportunities for practice and function. Threshold regression models can also help to emphasize specific components that lead to changes in the health status of a subject by identifying significant and insignificant covariates as they are applied in regression [3].

References