19 June2018

Defensive coding is an important part of Good Programming Practice (GPP) and has been defined as "an approach to programming intended to anticipate future changes of the data that might influence the coding algorithms. Ideally programs should be written in such a way that they will continue to work correctly in case of new or unexpected data values which did not exist at the time the code was developed" [ref]. This is especially important if we develop the code on a subset of data, so the analyses can be generated as soon as the final data become available.

For example, when when selecting a lab parameter we may write the code as "if upcase(param)=RBC" (RBC=red blood cells). We expect the text variable param to be in uppercase, but if the text is in lower case, mixed case or upper case, this code will work because of the use of the upcase function. Or, if we want to select patients who experience adverse events we may write "if upcase(AE)=:'Y'". The use of "=:" ensures that we catch 'Y', 'y', 'yes', 'Yes' and 'YES'. Another very common mistake made by programmers new to SAS is to write e.g. "if age lt 20 then agecat='<20'". This ought to be "if age ne . and age lt 20 then agecat='<20'" because SAS treats missings as negative infinity and we should always allow for the possibility of missing data, even if it is not expected.

Defensive coding requires us to anticipate unexpected results (an oxymoron perhaps). Consider the following code and resulting output:

The code has not worked and the result is unexpected. We should instead merge the data sets first and then in a second data step add the asterisk to sex. Consider another example:

Again, the result is not what we desire. If the variable being amended (in this case var2) exists on the first data set, then we do not have this erroneous result.

Finally, it is important to search the log for the right terms when confirming the program has run without a hitch. I guess most programmers maintain a list of such terms. Personally I search for: error, warning, repeat ("merge statement has more than one data set with repeats of by values"), multiple ("multiple lengths were specified for the variable"), gener ("missing values were generated as a result of performing an operation on missing values"), initial (variable is "unitialized"; note the American spelling).

# Papers and Programs

Statistical programming in medical research

### Unexpected results and defensive coding

17 June 2018

Last-observation-carried-forward is no longer a preferred method of imputation. It is considered quite crude and susceptible to bias, especially in e.g. Alzheimer's disease (EMA points to consider on missing data), and more sophisticated methods are favoured such as multiple imputation and mixed modelling when there are repeated measures over time. However, the code once used routinely to carry the last observation forward may be useful in analogous situations. Barbalau has described different ways of coding LOCF [ref]. I prefer to use the retain statement as follows:

Incidentally, the retain statement can also be used to order variables in a dataset, but care must be taken (see SAS help).

Last-observation-carried-forward is no longer a preferred method of imputation. It is considered quite crude and susceptible to bias, especially in e.g. Alzheimer's disease (EMA points to consider on missing data), and more sophisticated methods are favoured such as multiple imputation and mixed modelling when there are repeated measures over time. However, the code once used routinely to carry the last observation forward may be useful in analogous situations. Barbalau has described different ways of coding LOCF [ref]. I prefer to use the retain statement as follows:

Incidentally, the retain statement can also be used to order variables in a dataset, but care must be taken (see SAS help).

19 May 2018

Papers: Multitype events and the analysis of heart failure readmissions

Frailty modelling for multitype recurrent events in clinical trials

Programs: PaulBrownPhD/mtre

That composites are leaking statistical power may not be readily discerned because power estimates based on composites can be crude and unreliable; data simulations are required, especially when the number of component outcomes is large (>3 say) [ref]. Guesstimates for the correlations among outcomes will be assumed for the simulation and are obtainable from previous study data or registry data. Power estimates should also consider a range of plausible effect sizes, however, it can be difficult to anticipate discordant effects (qualitative heterogeneity), and the more outcomes included in the composite, the greater the risk of discordant effects and a loss of power[ref] or ambivalent results. In any case, statistical power should not be the driving factor when selecting a composite[ref]; the ultimate justification has to be a clinical one. Thus, it feels disingenuous to claim a composite has been employed to enhance power (a claim often made), especially when the alternatives may offer superior power and are paid little heed.

The characteristic feature of this alternative approach is the simultaneous modelling of correlated outcomes that collectively measure disease progression (assuming for the moment that outcomes are of the same type). There is no intermingling or prepping of outcomes and a subsequent loss of information as with composites. Thus, separate estimates of the treatment effect, and an assessment of heterogeneity, are a consequence of the model (reporting these statistics has been widely recommended as essential for the interpretation of composites[ref1, ref2, ref3, ref4], although they may often be absent[ref]). The model could assume a common effect across outcomes if this was deemed plausible. Otherwise an estimate of the overall effect could be calculated as a contrast of the individual estimates and thus incorporate weights. Unlike composites, the weighting is not inherent i.e. a consequence of the algorithm for deriving the composite, it is instead applied after the model has been fitted and is therefore made explicit. This is important given the subjectivity of weighting outcomes, e,g, patients and clinicians may prioritise outcomes differently[ref].

Since both approaches yield an estimate of overall benefit it is instructive to compare the results. For example, Mascha et al. contrasted a population average (generalised estimating equation (GEE)) model with the any-versus-none composite for multiple binary outcomes: complications classified by organ system for patients undergoing surgery[ref]. An odds ratio was estimated for the composite and a weighted average odds ratio was derived from the GEE model (see example SAS code below). The latter was more extreme, i.e. further from 1, and statistically significant, while the composite odds ratio was not significant (the p-value shifted from 0.169 to 0.023). Advantages of the modelling approach noted by the authors include "use of more information per subject, ability to apply clinical importance weights, and in most cases greater statistical power". Unlike the GEE model, power for the composite was sensitive to baseline frequencies which are difficult to anticipate; hence powering on a GEE model when designing a study may be preferable[ref].

Often our data include time to adverse events, rather than merely binary indicator variables. In this case the GEE model could be replaced by a random effects model with individual patient effects (frailties) that follow an assumed distribution. See the link to our paper above. We analysed mortality and heart failure related readmissions classified as emergency department visits and hospitalisations. Random effects for these event-types were assumed to follow a multivariate Normal distribution and the model was implemented in SAS (see code below; in our second paper linked above we give a alternative software options for this model). Popular composites were included for comparison, namely time-to-first, the unmatched win-ratio, and days-alive-and-out-of-hospital. By bootstrapping study data, it was shown that the random effects model offered considerably more power. The model also allowed for an assessment of the associations among outcomes which is missing from the composite analysis (i.e. between mortality, emergency department visits and re-hospitalisations). Other authors have discussed composites and modelling for time-to-event data, e.g. Wu & Cook (time-to-first and Wei, Lin & Weissfeld marginal model)[ref] and Rogers et al. (win-ratio and joint frailty model)[ref] who both make a case for the more thorough analysis with the treatment effect underestimated by the composite.

Sometimes, in addition to time-to-event data we have a longitudinal outcome (typically a biomarker)[ref]. In this scenario a global rank composite[ref] or a random effects model joint model[ref] could be used. A group from Utrecht made such a comparison and showed using data simulations, once again, that the joint model offers superior statistical power[ref]. Such joint modelling of outcomes has also been shown to reduce bias and lead to more efficient estimates which implies a smaller required sample size according to the strength of the association between biomarker and time-to-event outcomes[ref]. Random effects modelling has been described for more eclectic outcomes[ref] however we may wish to turn our attention to latent variable models in this case. For example, Teixeira-Pinto & Mauri used a latent variable model to analyse outcomes after coronary stenting[ref] and highlighted the advantages over a composite with attention given to missing data[ref]. Although the modelling approach cannot provide a meaningful estimate of overall benefit across disparate outcomes and a strong case could be made for composites under these circumstances. Gardiner has described SAS code[ref].

Papers: Multitype events and the analysis of heart failure readmissions

Frailty modelling for multitype recurrent events in clinical trials

Programs: PaulBrownPhD/mtre

### Introduction

In other blog posts I have described composite endpoints. A statistician may feel uneasy about composites that whittle down variables into a single value that quantifies the totality of treatment benefit. A loss of power is likely and tied to arbitrary decisions embedded within the definition of the composite. There seems to be a necessary trade-off in this regard with an increase in statistical efficiency coinciding with a decrease in clinical relevance, and vice versa (roughly speaking). In other words, at one end of a spectrum, opposing clinical composites, is multivariate modelling, the more natural choice as far as the statistician is concerned.That composites are leaking statistical power may not be readily discerned because power estimates based on composites can be crude and unreliable; data simulations are required, especially when the number of component outcomes is large (>3 say) [ref]. Guesstimates for the correlations among outcomes will be assumed for the simulation and are obtainable from previous study data or registry data. Power estimates should also consider a range of plausible effect sizes, however, it can be difficult to anticipate discordant effects (qualitative heterogeneity), and the more outcomes included in the composite, the greater the risk of discordant effects and a loss of power[ref] or ambivalent results. In any case, statistical power should not be the driving factor when selecting a composite[ref]; the ultimate justification has to be a clinical one. Thus, it feels disingenuous to claim a composite has been employed to enhance power (a claim often made), especially when the alternatives may offer superior power and are paid little heed.

### Neglected alternatives

One might think our impulse would be to use multivariate methods on multivariate data but instead we find ourselves discoursing on the best way to compress multivariate data into a nonparametric univariate analysis. However, alongside the expanding literature that simultaneously promotes and condemns composites, we are beginning to see some researchers explicitly rejecting the use of a composite in favour of modelling in their clinical trials[ref1, ref2]. And, at this moment, when composites have become the default thinking, we are encouraged to consider whether joint modelling remedies those issues tied to composites or whether it merely introduces problems of its own. After all, when we choose a composite for the primary analysis we are implicitly dismissing the alternative.The characteristic feature of this alternative approach is the simultaneous modelling of correlated outcomes that collectively measure disease progression (assuming for the moment that outcomes are of the same type). There is no intermingling or prepping of outcomes and a subsequent loss of information as with composites. Thus, separate estimates of the treatment effect, and an assessment of heterogeneity, are a consequence of the model (reporting these statistics has been widely recommended as essential for the interpretation of composites[ref1, ref2, ref3, ref4], although they may often be absent[ref]). The model could assume a common effect across outcomes if this was deemed plausible. Otherwise an estimate of the overall effect could be calculated as a contrast of the individual estimates and thus incorporate weights. Unlike composites, the weighting is not inherent i.e. a consequence of the algorithm for deriving the composite, it is instead applied after the model has been fitted and is therefore made explicit. This is important given the subjectivity of weighting outcomes, e,g, patients and clinicians may prioritise outcomes differently[ref].

Since both approaches yield an estimate of overall benefit it is instructive to compare the results. For example, Mascha et al. contrasted a population average (generalised estimating equation (GEE)) model with the any-versus-none composite for multiple binary outcomes: complications classified by organ system for patients undergoing surgery[ref]. An odds ratio was estimated for the composite and a weighted average odds ratio was derived from the GEE model (see example SAS code below). The latter was more extreme, i.e. further from 1, and statistically significant, while the composite odds ratio was not significant (the p-value shifted from 0.169 to 0.023). Advantages of the modelling approach noted by the authors include "use of more information per subject, ability to apply clinical importance weights, and in most cases greater statistical power". Unlike the GEE model, power for the composite was sensitive to baseline frequencies which are difficult to anticipate; hence powering on a GEE model when designing a study may be preferable[ref].

Often our data include time to adverse events, rather than merely binary indicator variables. In this case the GEE model could be replaced by a random effects model with individual patient effects (frailties) that follow an assumed distribution. See the link to our paper above. We analysed mortality and heart failure related readmissions classified as emergency department visits and hospitalisations. Random effects for these event-types were assumed to follow a multivariate Normal distribution and the model was implemented in SAS (see code below; in our second paper linked above we give a alternative software options for this model). Popular composites were included for comparison, namely time-to-first, the unmatched win-ratio, and days-alive-and-out-of-hospital. By bootstrapping study data, it was shown that the random effects model offered considerably more power. The model also allowed for an assessment of the associations among outcomes which is missing from the composite analysis (i.e. between mortality, emergency department visits and re-hospitalisations). Other authors have discussed composites and modelling for time-to-event data, e.g. Wu & Cook (time-to-first and Wei, Lin & Weissfeld marginal model)[ref] and Rogers et al. (win-ratio and joint frailty model)[ref] who both make a case for the more thorough analysis with the treatment effect underestimated by the composite.

Sometimes, in addition to time-to-event data we have a longitudinal outcome (typically a biomarker)[ref]. In this scenario a global rank composite[ref] or a random effects model joint model[ref] could be used. A group from Utrecht made such a comparison and showed using data simulations, once again, that the joint model offers superior statistical power[ref]. Such joint modelling of outcomes has also been shown to reduce bias and lead to more efficient estimates which implies a smaller required sample size according to the strength of the association between biomarker and time-to-event outcomes[ref]. Random effects modelling has been described for more eclectic outcomes[ref] however we may wish to turn our attention to latent variable models in this case. For example, Teixeira-Pinto & Mauri used a latent variable model to analyse outcomes after coronary stenting[ref] and highlighted the advantages over a composite with attention given to missing data[ref]. Although the modelling approach cannot provide a meaningful estimate of overall benefit across disparate outcomes and a strong case could be made for composites under these circumstances. Gardiner has described SAS code[ref].

### Final remarks

It is easy to find fault with composites when they appear cobbled together and ad hoc; the criticisms are well-known. Yet composites remain a favoured approach. Advances in software and methodology enjoin statisticians to adopt new and better methods and acknowledge that the demand for simplicity may not be extraneous (e.g. dictated by regulatory authorities or clients or the wider medical community) but self-imposed.### Notable sources:

Cook, R. and Lawless, J. (2007). The Statistical Analysis of Recurrent Events. http://www.springer.com/gp/book/9780387698090#
05 April 2018

Papers: How do we measure the effect size?

Influence of component outcomes on the composite

Programs: PaulBrownPhD/probindex

Some composites combine endpoints of the same type i.e. time-to-event or binary, while others combine miscellaneous types. The former often simply collapse relevant endpoints e.g. time-to-first[ref], any-versus-none[ref]. The latter, on the other hand, may attempt to rank patients from the most adverse response to the most favourable, while bearing in mind a select group of prioritised outcomes e.g. a clinical score[ref], and the global rank[ref] and unmatched win-ratio[ref] (the unmatched win-ratio was described for time-to-event outcomes but is easily adapted for multiple noncommesnurate outcomes[ref]).

Other composites unify outcomes to create a measure that is itself intrinsically meaningful e.g. days-alive-and-out-of-hospital[ref], and a trichotomous clinical composite[ref]. Finally, there are those composites that standardise responses from disparate outcomes before taking a sum or average across components[ref1, ref2]. More than simply being proposed, composites of disparate outcomes are becoming increasingly common primary outcomes in important randomised controlled trials[ref1, ref2, ref3]. These composites have been compared[ref1, ref2].

Papers proposing new composites rarely run data simulations to evaluate their performance; these normally appear in the literature much later [eg ref]. However, both the use of composites and criticism highlighting their limitations have been presented [eg ref1, ref2]. In fact, the European Medicines Agency guideline on research in acute heart failure specifically recommends against the use of composites that comprise disparate outcomes[ref].

Basically, composites are complex constructions that often yield limited ordinal responses. Regarding the weighting of component outcomes, researchers often declare that no weighting has been employed. However, weighting can be implied by the construction of the composite and also data-dependent. What is normally meant by 'weighting' are the numerical coefficients specified by an investigator to yield a weighted estimate of the treatment effect. However, any time outcomes are prioritised there is a weighting mechanism at play. For example, a global rank may ignore almost completely those outcomes given low priority, or, conversely, it may be dominated by them [ref]. Even a time-to-first composite is favouring those outcomes with the higher incidence rate; a moderate difference on mortality may be drowned out by another less important event-type where no difference is observed. Clearly any masking of effects implies some weighting of outcomes or favouritism. Thus, there is a disproportionate representation of outcomes in the composite, and it is difficult to anticipate, inadvertent and often unknown.

For example, in a global rank the outcomes will be favoured according to the hierarchy that prioritises them, but the extent to which outcomes with lower priority are ignored depends on the data. We used data simulations to generate the figure below for two composites comprising outcomes: mortality, dyspnea, troponin, creatinine, NT-proBNP (as per Felker & Maisel [ref]). We can see that the global rank composite is insensitive to the biomarker NT-proBNP (in the hierarchy of outcomes it was positioned last) and more influenced by dyspnea, a subjective outcome (large variance) which was prioritised after mortality (a low death rate was assumed). On the other hand, the average z-score shows a more congruent relationship with the individual outcomes because it is a straight average of z-scores. This weighting or Influence is not explicit and barely intentional.

Since an unmatched win-ratio or global rank of multiple endpoints is attempting to arrange patients according to their overall response, we might say that the relative contribution of each outcome to the composite is beside the point. However, if after statistical analysis we have declared the new treatment to be superior, we would like to know what is driving this result. If the mortality and hospital readmission rates are low, then the result may well be dominated by a biomarker, i.e. the composite would be very highly correlated with an endpoint which is obviously considered tenuous otherwise it would have been deemed the primary outcome. Since we do not know exactly what the outcome is, i.e. what the composite is made up of, we cannot make sense of the result. We should of course look at the results from separate (under-powered) analyses of the components but this can give rise to contentious discussion when it becomes clear that effects have been masked or subdued or are counteracted in the composite.

Our results show that Influence of individual outcomes that comprise a composite are not well anticipated. This is slightly analogous to data-dependent methods of covariate adjustment e.g. stepwise methods. In the analysis plan or protocol we can describe the algorithm for selecting covariates to retain in the model, but we cannot say what the analysis will ultimately adjust for. This is perhaps one reason why such methods are out of favour. As Senn says: "the wisest course open to the frequentist is to make a list of covariates suspected to be important and to fit these regardless" [ref]. Likewise with composite endpoints: if an analysis plan states that the primary endpoint is a global rank of several outcomes, or time-to-first of a number of adverse events, etc., we are informed of the algorithm only. The degree to which the amalgamated outcomes represents the component outcomes is speculation. In other words, we cannot articulate exactly what our primary outcome is. Influence calls into question the belief that a composite measures the "overall" effect of treatment.

It is an obvious slight of hand: there is a gain in power while appearing to use clinical outcomes. And, there is no obligation to specify post-hoc what the contribution of the individual outcomes turned out to be. In other words, what the primary endpoint turned out to be. If the audience were informed of this, would it not affect their interpretation of the findings? Will it not affect the reproducibility of the results? It will not be too surprising if such analyses lead to ambiguous and contentious results. It is conceivable that our global rank reduces to a biomarker i.e. is highly correlated with it. Or a time-to-first endpoint neglects mortality. And we learn this only after massive investment of resources in the trial. There ought to be some awareness of this risk, and also the risk of opposing effects. Influence, as we have defined it, could be gauged using data simulations at the design stage to highlight the issues.

My concern is that convention becomes a safeguard for suboptimal methods. This is seen in drug development and the 'slow march to market' where convention often entails the efficient (repeated) use of inefficient methods. Convention and simplicity make statisticians and programmers efficient and their work less prone to error. Interestingly, though, composites have been promoted largely by clinicians because clinical understanding is needed to inform the construction of the composite. Statistics journals have paid little heed[ref]; they belatedly publish data simulations to identify the faults with certain composites. It seems that statisticians may be failing to influence the discussion.

Obtaining the confidence intervals is a little more difficult; see the full code in GitLab (link above).

Papers: How do we measure the effect size?

Influence of component outcomes on the composite

Programs: PaulBrownPhD/probindex

### Types of composite

There are a number of motivating factors for employing a composite. For example a composite may be contrived to handle missing data[ref] or competing risks[ref]; or yield phase II results that better predict phase III[ref]; or offer a more succinct clinically meaningful measure[ref]; or capture risk-benefit; or increase statistical power[ref]. Various algorithms for constructing a composite from certain component outcomes have been described and may reveal such an impetus.Some composites combine endpoints of the same type i.e. time-to-event or binary, while others combine miscellaneous types. The former often simply collapse relevant endpoints e.g. time-to-first[ref], any-versus-none[ref]. The latter, on the other hand, may attempt to rank patients from the most adverse response to the most favourable, while bearing in mind a select group of prioritised outcomes e.g. a clinical score[ref], and the global rank[ref] and unmatched win-ratio[ref] (the unmatched win-ratio was described for time-to-event outcomes but is easily adapted for multiple noncommesnurate outcomes[ref]).

Other composites unify outcomes to create a measure that is itself intrinsically meaningful e.g. days-alive-and-out-of-hospital[ref], and a trichotomous clinical composite[ref]. Finally, there are those composites that standardise responses from disparate outcomes before taking a sum or average across components[ref1, ref2]. More than simply being proposed, composites of disparate outcomes are becoming increasingly common primary outcomes in important randomised controlled trials[ref1, ref2, ref3]. These composites have been compared[ref1, ref2].

### The problems with composites

Because composites are often employed as the primary endpoint in clinical trials, they affect current debates in cardiovascular medicine, such as the benefit of statin therapies. A survey revealed that approximately 50% of cardiovascular clinical trials adopted a composite[ref]. Although the research environment is not entirely similar and new composites have emerged, the conclusions of a literature review conducted over 20 years ago regarding the use of composites rings true today: "There are serious deficiencies in the methodology currently used in the construction of [composites]. First, many authors develop ad hoc arbitrarily constructed [composites] for immediate use (often as a primary outcomes measure) in descriptive or comparative studies. Construction of such ad hoc [composites] without evaluation of their measurement properties is scientifically debatable"[ref].Papers proposing new composites rarely run data simulations to evaluate their performance; these normally appear in the literature much later [eg ref]. However, both the use of composites and criticism highlighting their limitations have been presented [eg ref1, ref2]. In fact, the European Medicines Agency guideline on research in acute heart failure specifically recommends against the use of composites that comprise disparate outcomes[ref].

Basically, composites are complex constructions that often yield limited ordinal responses. Regarding the weighting of component outcomes, researchers often declare that no weighting has been employed. However, weighting can be implied by the construction of the composite and also data-dependent. What is normally meant by 'weighting' are the numerical coefficients specified by an investigator to yield a weighted estimate of the treatment effect. However, any time outcomes are prioritised there is a weighting mechanism at play. For example, a global rank may ignore almost completely those outcomes given low priority, or, conversely, it may be dominated by them [ref]. Even a time-to-first composite is favouring those outcomes with the higher incidence rate; a moderate difference on mortality may be drowned out by another less important event-type where no difference is observed. Clearly any masking of effects implies some weighting of outcomes or favouritism. Thus, there is a disproportionate representation of outcomes in the composite, and it is difficult to anticipate, inadvertent and often unknown.

### Influence

To illustrate this point we used data simulations (see link to paper above). The probability index is used as an effect measure with the assumed effect size of the individual component plotted on the horizontal axis and the resulting effect size for the composite on the vertical axis (see figure in paper). Thus we may define the slope of the line as the Influence of the component outcome. The investigator could explore such a plot when designing a trial to get a sense of how the composite is weighting the components e.g. whether some components can overwhelm the composite while others are suppressed. An estimate of the slope which quantifies Influence could be reported, although it is dependent on the assumed effect sizes and not just the definition of the composite.For example, in a global rank the outcomes will be favoured according to the hierarchy that prioritises them, but the extent to which outcomes with lower priority are ignored depends on the data. We used data simulations to generate the figure below for two composites comprising outcomes: mortality, dyspnea, troponin, creatinine, NT-proBNP (as per Felker & Maisel [ref]). We can see that the global rank composite is insensitive to the biomarker NT-proBNP (in the hierarchy of outcomes it was positioned last) and more influenced by dyspnea, a subjective outcome (large variance) which was prioritised after mortality (a low death rate was assumed). On the other hand, the average z-score shows a more congruent relationship with the individual outcomes because it is a straight average of z-scores. This weighting or Influence is not explicit and barely intentional.

Since an unmatched win-ratio or global rank of multiple endpoints is attempting to arrange patients according to their overall response, we might say that the relative contribution of each outcome to the composite is beside the point. However, if after statistical analysis we have declared the new treatment to be superior, we would like to know what is driving this result. If the mortality and hospital readmission rates are low, then the result may well be dominated by a biomarker, i.e. the composite would be very highly correlated with an endpoint which is obviously considered tenuous otherwise it would have been deemed the primary outcome. Since we do not know exactly what the outcome is, i.e. what the composite is made up of, we cannot make sense of the result. We should of course look at the results from separate (under-powered) analyses of the components but this can give rise to contentious discussion when it becomes clear that effects have been masked or subdued or are counteracted in the composite.

Our results show that Influence of individual outcomes that comprise a composite are not well anticipated. This is slightly analogous to data-dependent methods of covariate adjustment e.g. stepwise methods. In the analysis plan or protocol we can describe the algorithm for selecting covariates to retain in the model, but we cannot say what the analysis will ultimately adjust for. This is perhaps one reason why such methods are out of favour. As Senn says: "the wisest course open to the frequentist is to make a list of covariates suspected to be important and to fit these regardless" [ref]. Likewise with composite endpoints: if an analysis plan states that the primary endpoint is a global rank of several outcomes, or time-to-first of a number of adverse events, etc., we are informed of the algorithm only. The degree to which the amalgamated outcomes represents the component outcomes is speculation. In other words, we cannot articulate exactly what our primary outcome is. Influence calls into question the belief that a composite measures the "overall" effect of treatment.

It is an obvious slight of hand: there is a gain in power while appearing to use clinical outcomes. And, there is no obligation to specify post-hoc what the contribution of the individual outcomes turned out to be. In other words, what the primary endpoint turned out to be. If the audience were informed of this, would it not affect their interpretation of the findings? Will it not affect the reproducibility of the results? It will not be too surprising if such analyses lead to ambiguous and contentious results. It is conceivable that our global rank reduces to a biomarker i.e. is highly correlated with it. Or a time-to-first endpoint neglects mortality. And we learn this only after massive investment of resources in the trial. There ought to be some awareness of this risk, and also the risk of opposing effects. Influence, as we have defined it, could be gauged using data simulations at the design stage to highlight the issues.

My concern is that convention becomes a safeguard for suboptimal methods. This is seen in drug development and the 'slow march to market' where convention often entails the efficient (repeated) use of inefficient methods. Convention and simplicity make statisticians and programmers efficient and their work less prone to error. Interestingly, though, composites have been promoted largely by clinicians because clinical understanding is needed to inform the construction of the composite. Statistics journals have paid little heed[ref]; they belatedly publish data simulations to identify the faults with certain composites. It seems that statisticians may be failing to influence the discussion.

### Appendix

The probability index (PI) is easily obtained from SAS using proc npar1way. It is derived from the Mann-Whitney U statistic (for survival outcomes we would use Gehan's generalised Wilcoxon). U may be thought of as the number of 'wins' resulting if every patient in the Active group were compared to every patient in the Control group. The probability index is this number divided by the total number of such comparisons (i.e. the number of patients in one group multiplied by the number in the other). The SAS code is as follows:Obtaining the confidence intervals is a little more difficult; see the full code in GitLab (link above).

### Related post:

Composite endpoints and sample size estimation
08 February 2018

Papers: Composite endpoints in acute heart failure research

Power and sample size estimation for rank-based composite endpoints

Programs

The selection of endpoints to form the composite is not restricted and is somewhat arbitrary. The component outcomes will all reflect the underlying clinical condition but they should not be too highly correlated with each other in which case the information gain is minimal. Also, the anticipated effect of treatment should be in the same direction across all component outcomes, but not necessarily of the same magnitude.

In the CJC paper (available at the above web link) we considered four composites: the average Z-score (ZS), win-ratio (WR), global rank (GR) and clinical composite (CC). Construction of WR and GR are illustrated in the following two figures.

The ZS and WR were proposed by statisticians and the GR and CC by clinicians, roughly speaking. Composites vary in their attempt to maximise clinical meaning while retaining statistical power, depending on who is advocating them, and it is unclear whether any achieve the right balance. They are usually pulled too far in one direction or the other. For example, statisticians have argued against the dichotomisation of endpoints for the sake of clinical meaning (ref). A reasonable point. But has this led us to become too fixated on power (ref) at the expense of clinical meaning (ref)?

The macros can be implemented within SAS as follows:

It is necessary to throw the (superfluous) output to an .lst file using proc printto; otherwise SAS will stall when the output window is full and it needs to be cleared intermittently. If memory serves, it took about five days for the programs to run. The code was validated in many and various ways. For example, by replicating the sample size for the BLAST study which used ZS for the primary outcome and FIGHT which used GR.

Using these data simulations we can evaluate the performance of the composites as the assumed effect size varies. The following figure shows how power is affected when the assumed effect size on a given outcome shifts from pessimistic to optimistic. In this way we can see that the power of the overall composite is more sensitive to some outcomes than others, and this depends on the construction of the composite. For example, the WR does a better job of favouring outcomes higher up in the hierarchy than the GR because the cut-offs employed for the GR restrict the influence of outcomes, yet the WR restricts the influence of BNP. Thus the composites are weighting the individual outcomes.

Weighting and power are inextricably linked and although a hoped for increase in power is a common justification for employing a composite, it is hardly persuasive. Sun et al. (ref in our paper) showed a single outcome can yield more power than a CC. This is partly because composites discard data with seeming indifference during their construction e.g., time-to-first ignores recurrent events, event-types (and hence the correlations among them) and event severity. Adding an outcome does not necessarily compensate for this loss if the additional outcome is not sensitive to treatment (ref). Often clinical windows are imposed on the data and consequently events that exceed the cut-off time are dismissed as irrelevant.

Papers: Composite endpoints in acute heart failure research

Power and sample size estimation for rank-based composite endpoints

Programs

*:*PaulBrownPhD/cepower### The need for composites

The medical research council guideline on Developing and Evaluating Complex Interventions statesA single primary outcome, and a small number of secondary outcomes, is the most straightforward from the point of view of statistical analysis. However, this may not represent the best use of the data, and may not provide an adequate assessment of the success or otherwise of an intervention which may have effects across a range of domains (ref).Thus, for certain disease states there is a shift away from designating a single endpoint as the primary outcome of a clinical trial. When the disease condition can be represented by multiple endpoints, allowing conclusions to be dictated by a significance test on one of these alone is inadequate. This dilemma is more acute when the statistical power endowed by endpoints is inversely proportional to their importance. For example, in heart failure trials, the clinical outcomes with low incidence (such as mortality) yield impractical sample sizes, yet a sensitive biomarker which provides sufficient power remains a surrogate outcome. Therefore, combining endpoints to form a univariate outcome that measures total benefit has been the trend. Potentially, this 'composite endpoint' offers reasonable statistical power while tracking the treatment response across a constellation of symptoms and obviating the normal issues that arise from multiple testing i.e. an inflated alpha.

The selection of endpoints to form the composite is not restricted and is somewhat arbitrary. The component outcomes will all reflect the underlying clinical condition but they should not be too highly correlated with each other in which case the information gain is minimal. Also, the anticipated effect of treatment should be in the same direction across all component outcomes, but not necessarily of the same magnitude.

In the CJC paper (available at the above web link) we considered four composites: the average Z-score (ZS), win-ratio (WR), global rank (GR) and clinical composite (CC). Construction of WR and GR are illustrated in the following two figures.

The ZS and WR were proposed by statisticians and the GR and CC by clinicians, roughly speaking. Composites vary in their attempt to maximise clinical meaning while retaining statistical power, depending on who is advocating them, and it is unclear whether any achieve the right balance. They are usually pulled too far in one direction or the other. For example, statisticians have argued against the dichotomisation of endpoints for the sake of clinical meaning (ref). A reasonable point. But has this led us to become too fixated on power (ref) at the expense of clinical meaning (ref)?

### Data simulations

The JMASM paper (web link above) considers ZS and GR only (the WR and CC are easily coded and hence not included). The macros described in the paper and made available at the web link use data simulations to estimate power given certain assumptions about effect sizes and correlations among the outcomes. The following figure illustrates how random samples satisfying the pre-specified conditions are generated using iteration:The macros can be implemented within SAS as follows:

It is necessary to throw the (superfluous) output to an .lst file using proc printto; otherwise SAS will stall when the output window is full and it needs to be cleared intermittently. If memory serves, it took about five days for the programs to run. The code was validated in many and various ways. For example, by replicating the sample size for the BLAST study which used ZS for the primary outcome and FIGHT which used GR.

Using these data simulations we can evaluate the performance of the composites as the assumed effect size varies. The following figure shows how power is affected when the assumed effect size on a given outcome shifts from pessimistic to optimistic. In this way we can see that the power of the overall composite is more sensitive to some outcomes than others, and this depends on the construction of the composite. For example, the WR does a better job of favouring outcomes higher up in the hierarchy than the GR because the cut-offs employed for the GR restrict the influence of outcomes, yet the WR restricts the influence of BNP. Thus the composites are weighting the individual outcomes.

Weighting and power are inextricably linked and although a hoped for increase in power is a common justification for employing a composite, it is hardly persuasive. Sun et al. (ref in our paper) showed a single outcome can yield more power than a CC. This is partly because composites discard data with seeming indifference during their construction e.g., time-to-first ignores recurrent events, event-types (and hence the correlations among them) and event severity. Adding an outcome does not necessarily compensate for this loss if the additional outcome is not sensitive to treatment (ref). Often clinical windows are imposed on the data and consequently events that exceed the cut-off time are dismissed as irrelevant.

### Related post:

Probability index for composite endpoints
16 June 2018

Papers: Derivation, validation, and refinement of a risk score

Programs: PaulBrownPhD/riskmod

Papers: Derivation, validation, and refinement of a risk score

Programs: PaulBrownPhD/riskmod

15 June 2018

Papers: The importance of the ECG in patients with acute heart failure

Programs: PaulBrownPhD/comprisk

There are two approaches. The cause-specific hazard is obtained by including the competing risk as censored observations (see first proc phreg below). Alternatively, it is a simple matter in proc phreg to implement Fine & Gray's model for the cumulative incidence function (CIF) using 'eventcode' (see the second proc phreg below, or see this SAS example). This is a new feature in SAS 9.4. As Ying et al. note (see weblink below): "An increasingly common practice of assessing the probability of a failure in competing risks analysis is to estimate the cumulative incidence function, which is the probability subdistribution function of failure from a specific cause".

The only difference in the second phreg proc are the values listed in 'status' (which indicates censoring), i.e. status=2 (the competing risk) is not included and the event of interest appears in the eventcode statement. It is useful then, when deriving time-to-event endpoints, to include a status variable indicating 0=censored, 1=readmission and 2=patients who die before readmission. For the full analysis program see tableS1.sas at the GitLab link above.

There is a SAS macro for estimating cumulative incidence (CI) based on Gooley et al. (see Gooley weblink below). CI is variably referred to as the cause-specific risk and the crude incidence curve. A key remark in Gooley et al. is the following explanation of the probability calculation: "The contribution to the estimate of the probability of failure from the cause of interest due to failures that occur after patients are censored is increased over the contribution from previous failures. The increase is equal to the potential contribution from the censored patient(s) that is redistributed among patients known to be at risk of failure beyond the time that censoring occurred [so far, so good]. Note that if a patient fails from a competing risk, the potential contribution to the estimate for this patient becomes zero, as failure from the event of interest is no longer possible. Hence, patients who fail from a competing risk are treated differently from patients who are censored due to a lack of follow-up." I have placed the full SAS macro "CumIncid (full macro).sas" in GitLab; in addition, a concise version which I edited down to the essential code "CumIncid (simplified).sas" is made available.

As an aside: we should ideally account for multiple readmissions within patients (see MTRE blog post). This can be handled by a shared frailty model or Andersen-Gill etc. In this case, the competing risk of death could be treated as just 'another event in the event process' i.e. the final event [ref]. This is crude, however, and a joint frailty model should be considered (see the SAS code at the bottom of Liu & Huang).

Ying, Using the PHREG Procedure to Analyze Competing-Risks Data

Gooley, Estimation of failure probabilities in the presence of competing risks

Fine & Gray, A Proportional Hazards Model for the Subdistribution of a Competing Risk

Papers: The importance of the ECG in patients with acute heart failure

Programs: PaulBrownPhD/comprisk

### Introduction

In this study, the secondary endpoint was time from hospital discharge to first heart failure related readmission. For the analysis, if the proportional hazards assumption is reasonable, we would use standard Cox regression to adjust for factors deemed clinically important (e.g. age, sex blood pressure etc.) The proportional hazards assumption was confirmed using a formal statistical test and a visual inspection of the log-cumulative hazard plot [see AFT blog post]. The important issue here is that time to readmission is not independent of mortality which leads to censored survival times. There have been review papers [ref] calling for more appropriate analyses of such 'competing risks' and thus we can expect non-statistical reviewers to be conversant of the issue i.e. reviewers would certainly ask about censoring of readmission by death. There is a lack of consistency in how death is handled in this context [ref]. Because time to readmission starts at discharge, we lose the in-hospital deaths. However, mortality accounted for an appreciable number of the censoring events observed in this study (>30%).There are two approaches. The cause-specific hazard is obtained by including the competing risk as censored observations (see first proc phreg below). Alternatively, it is a simple matter in proc phreg to implement Fine & Gray's model for the cumulative incidence function (CIF) using 'eventcode' (see the second proc phreg below, or see this SAS example). This is a new feature in SAS 9.4. As Ying et al. note (see weblink below): "An increasingly common practice of assessing the probability of a failure in competing risks analysis is to estimate the cumulative incidence function, which is the probability subdistribution function of failure from a specific cause".

The only difference in the second phreg proc are the values listed in 'status' (which indicates censoring), i.e. status=2 (the competing risk) is not included and the event of interest appears in the eventcode statement. It is useful then, when deriving time-to-event endpoints, to include a status variable indicating 0=censored, 1=readmission and 2=patients who die before readmission. For the full analysis program see tableS1.sas at the GitLab link above.

There is a SAS macro for estimating cumulative incidence (CI) based on Gooley et al. (see Gooley weblink below). CI is variably referred to as the cause-specific risk and the crude incidence curve. A key remark in Gooley et al. is the following explanation of the probability calculation: "The contribution to the estimate of the probability of failure from the cause of interest due to failures that occur after patients are censored is increased over the contribution from previous failures. The increase is equal to the potential contribution from the censored patient(s) that is redistributed among patients known to be at risk of failure beyond the time that censoring occurred [so far, so good]. Note that if a patient fails from a competing risk, the potential contribution to the estimate for this patient becomes zero, as failure from the event of interest is no longer possible. Hence, patients who fail from a competing risk are treated differently from patients who are censored due to a lack of follow-up." I have placed the full SAS macro "CumIncid (full macro).sas" in GitLab; in addition, a concise version which I edited down to the essential code "CumIncid (simplified).sas" is made available.

As an aside: we should ideally account for multiple readmissions within patients (see MTRE blog post). This can be handled by a shared frailty model or Andersen-Gill etc. In this case, the competing risk of death could be treated as just 'another event in the event process' i.e. the final event [ref]. This is crude, however, and a joint frailty model should be considered (see the SAS code at the bottom of Liu & Huang).

### Appendix SAS procs for survival data

__proc lifetest__: nonparametric, Kaplan-Meier estimates, log-rank and Wilcoxon tests (use the latter if proportional hazards assumption is not reasonable); in strata statement after forward slash specify e.g. 'test=logrank' (then obtain separate estimates of S(t) for each strata); as Dave Collett p332 noted: "there can be a difference between the results of the log-rank test obtained using the strata statement alone and that obtained using the test statement alone. This is due to the different ways in which tied survival times are handled by the two statements"__proc lifereg__: parametric, specify baseline hazard or accelerated failure time model (see AFT blog post); yields acceleration factor__proc phreg__: semi-parametric, Cox's proportional hazards model, allows adjustment for covariates; yields hazard ratio; Wald test, likelihood score test (if there are no tied survival times this test in Cox regression is identical to the log-rank test)### Notable sources:

Analyzing Survival Data with Competing Risks Using SAS SoftwareYing, Using the PHREG Procedure to Analyze Competing-Risks Data

Gooley, Estimation of failure probabilities in the presence of competing risks

Fine & Gray, A Proportional Hazards Model for the Subdistribution of a Competing Risk

03 June 2018

P-WRF shows a high early death rate while P-IRF has a steady death rate causing the survival curves to cross. We queried the validity of the proportional hazards assumption. (Note that we used the Wilcoxon test instead of the more common log-rank test for this reason.)

We used an AFT model with an extended generalised gamma distribution:

Collett, Modelling Survival Data in Medical Research

Modelling survival data with parametric regression models (includes SAS code)

### Research question and background

- in patients with acute heart failure (AHF), patients do poorly if they have 1) co-existing renal disease (aka CKD) 2) worsening renal failure (aka WRF)
- current definitions are focused on magnitude of change, but not underlying severity of CKD
- evaluate a new definition that incorporates both baseline renal function and change in renal function while in hospital
- evaluate the association between these definitions and short and long-term clinical outcomes in patients with AHF
- 696 patients with eGFR calculable at admission and discharge
- definitions: preserved (P) > 45 at admission and discharge; reduced (R) < 45 at admission or discharge; worsening (WRF), stable (SRF), improved (IRF) renal function based on 20% change
- 6 'treatment' groups: P-WRF, P-SRF, P-IRF, R-WRF, R-SRF, R-IRF
- hypotheses: R vs P, IRF vs SRF vs WRF

### Comparison of survival among groups

P-WRF shows a high early death rate while P-IRF has a steady death rate causing the survival curves to cross. We queried the validity of the proportional hazards assumption. (Note that we used the Wilcoxon test instead of the more common log-rank test for this reason.)

### Evaluating the proportional hazards assumption

There are a number of ways to test the PH assumption, as follows (note: -logS(T) will have a unit exponential distribution, the cumulative hazard function thus increases linearly against time):### Accelerated failure time (AFT) model

- the AFT model is an alternative to the common proportional hazards (PH) model
- Cox PH model: in terms of HR (= constant if proportional hazards)
- AFT: in terms of S(t), relates to Kaplan-Meier plot (above)
- S1(t)=S2(⍬t), ⍬ = acceleration factor, e.g. ratio of medians (see the figure below which illustrates the concept)
- Example: dog years, ⍬ = 7
- HR > 1: exposure harmful to survival; HR < 1: exposure benefits survival
- ⍬ > 1: exposure benefits survival; ⍬ < 1: exposure harmful to survival
- ⍬ = HR = 1: no effect from exposure

We used an AFT model with an extended generalised gamma distribution:

### Model checking

To examine the model fit, the fitted survivor function is plotted against the Kaplan-Meier estimates:### Notable sources:

Comparing proportional hazards and accelerated failure time models: an application in influenzaCollett, Modelling Survival Data in Medical Research

Modelling survival data with parametric regression models (includes SAS code)

Subscribe to:
Posts (Atom)