Handling Missing Values of Continuous Variables in Clinical Data.

Author:  Genpro Statistics Team | Date Posted: 06/July/2021



Missing values are common in clinical data due to varied reasons. The main disadvantage of missing values is the reduction of statistical power due to reduced sample size and the possibility of biased estimates. Especially when it comes to randomized clinical trials, missing values can lead to the disruption of the validity of randomization, making the comparison between treatment groups meaningless.

ICH E9 provides instructions on handling missing values under the section “DATA ANALYSIS CONSIDERATIONS”. Proper planning and careful conduct can control the occurrence of missing data, but we must be ready to deal with it even before the data is collected. Discussing this issue from an analysis perspective, the only solution is to identify the missing mechanism (MCAR, MAR, MNAR) and pattern and then execute applicable steps (Delete the missing records in case of MCAR and impute in case of the other two).

About Missing data

Understanding the cause and pattern of missing data is important in deciding the method for handling it. Missing data in a single variable can be classified into any of the following mechanisms as initially described by Rubin (1976).

1. Missing Completely At Random (MCAR)

The missing values of a data set are a random sub-sample of the complete data set. I.e., the missing data is not influenced by any observed or unobserved variables instead are occurring due to random reasons. Missing values of subject leaving the study due to sudden death by accident or an equipment failure while examination can be considered as MCAR.

2. Missing At Random (MAR)

Depends on known factors or variables, but not on the unobserved data. Missing values of subjects who are removed from study due to protocol deviations or adverse events are examples of MAR.

3. Missing Not At Random (MNAR)

Depends on the observed as well as unobserved data. Patient not continuing the study due to disease recovery or is not wanting to report it intentionally are examples. The reason for the missing value is not always known.

Differentiating these mechanisms in real world data is not easy and they do not provide us with a deterministic approach to model the probability of missing data. The method chosen are not primarily depending on the properties of the method under each mechanism but rather on the acceptance of it in each scenario.

The missing data pattern may be arbitrary, where there is no pattern in the missing data structure or maybe be structured, the monotone missing pattern. It is necessary to identify the missing data pattern to make decisions regarding the applicability of different imputation approaches.

Another important factor to be considered is the scale level of the variable to be imputed. Imputations for categorical and continuous variables are to be carefully chosen.

Complete Case Analysis

In the Complete Case analysis approach, only non-missing data points (complete cases) are used in analysis. Even small amount of missing data can reduce the number of evaluable cases and thus affect the planned randomization ratio and sample size leading to reduced power and increased bias. This approach is justifiable in MCAR situation and not in MAR or MNAR.

This is not recommended as the primary analysis, instead is considered in exploratory studies or as secondary supportive analysis (sensitivity analysis).

Single Imputation

Single Imputation (SI) methods replaces each missing value by just one imputation value and analyses are conducted as if all data were observed. There are several SI imputation techniques that are readily used like:

• Last Observation carried Forward (LOCF)

Widely applied statistical approach in case of longitudinal data. If a person drops out from the study, value of his last observation/assessment is used to replace all subsequent missing values. This method assumes that if a value for an examination is missing for a repeated measure of an assessment of a person, then it has no change from the value form the previous examination.

• Best Observation carried Forward (BOCF)

Widely used in case of early treatment discontinuation. This method can be misleading if the results are gradually improving over the course of the study.

• Worst Observation carried Forward (WOCF)

This is the most conservative approach comparing to the above two. This method is seen commonly employed in studies with laboratory results as endpoints.

• Mean/Median value Substitution

A quick, simple, and rough imputation technique for continuous variables which replaces all missing values with mean or median. Recommended in cases where only a handful of values are missing.

Though an easy method enabling to maintain the actual sample size, this method is criticized for bringing in bias and underestimating standard errors. Nevertheless, it is recommended in cases where the missing mechanism is MCAR. Methods like LOCF and Mean substitution are widely used in clinical research but is only recommended in exploratory or supportive analysis.

Maximum likelihood estimation

Maximum Likelihood estimation does not impute data, instead uses all available data to model equations for estimation and provides unbiased parameter estimates and standard errors. MLE is widely used and accepted in Clinical research. The advantage of this method compared to MI is it is simpler as it requires less decisions to be made. MLE is widely used and accepted in Clinical research. Direct maximum likelihood method and expectation-maximization algorithm are the main ML methods.

Multiple imputation

As the name indicates Multiple imputation (MI) imputes more than one value corresponding to each missing value. MI analysis includes following three steps:

1. Imputation
Missing values are imputed m times (m > 1), resulting in m complete data sets.

2. Analysis
Each of these datasets is analysed using the statistical model chosen to answer the research question.

3. Pooling

The m analysis results(estimates) are combined to one MI result(estimate).
MI produces a more reasonable estimate than SI. It is model based and makes use of more than one imputation corresponding to each missing value which brings in more variation. MI is recommended and widely used in Clinical research.

Case study

For the study under consideration, the primary efficacy endpoint is the mean change from baseline at week 4 on the PANSS total score. PANSS score was measured repeatedly at week 1, 2, 3, and 4 following the initial intake of the investigational product.

The mixed-effects repeated measures model approach (MMRM) with mean change as the dependent variable and treatment, visit, and treatment-by-visit interaction as fixed effects, and baseline value as covariate was proposed to compare the treatment groups and placebo. MMRM is longitudinal likelihood-based data analysis, which makes use of all the observed data from each subject. MMRM is based on the assumptions of MAR and that the dropouts would behave similarly to other subjects in the same treatment group, and possibly with similar covariate values, if they had not dropped out. And hence we required a strong supporting analysis technique to confirm the results from the MMRM analysis.

A sensitivity analysis using Multiple Imputation (MI) methods, assuming non-monotone missing data patterns, and assuming the data are missing not at random (MNAR), was performed to assess the effects of missing data in the analysis of the primary efficacy endpoint. Under an assumption of MNAR, a pattern-mixture model (PMM) with control-based pattern imputation was used. Control-based imputation will be applied so that there is no direct use of observed data from the treatment groups in estimating the imputation model. The method is derived such that it builds its imputation only on the placebo group data. Every missing value was imputed 15 times and thus creating 15 complete datasets. The imputed datasets were then analysed with an analysis of covariance (ANCOVA) model fitted for the fixed, categorical effects of treatment, and the continuous, fixed covariate of Baseline PANSS Total Score. The results were then pooled together to provide the combined estimates to compare the treatment groups with placebo.

In addition, SI imputation using LOCF was used. LOCF method assumes that if a value for an examination is missing for a repeated measure of an assessment of a person, then it has no change from the value form the previous examination.


Noncompleters or early dropouts can be influential contributors to the difference in mean treatment effects as early discontinuation can be due treatment failure or even because of disease recovery (indicating treatment response). Since the inferences from the study is going to influence a larger population, it is important to treat the missing values as close to their unobserved values.

Missing data is always a potential source of bias in clinical research. The proposed power for the study if often affected by missing data, and we must plan beforehand to accommodate the power lose to missing data. Proper handling of missing data can control the effects of missing data in the interpretations.

The techniques used for handling missing data have evolved over time. Even when there is no definite answer to the best choice of techniques, one can compare different techniques and suggest the one most appropriate to the study scenario. Sensitivity analysis can be used to understand the robustness of results.


You may also like: