Author: Genpro Statistics Team
Multiple imputation is highly recommended as a valid method for handling missing data. It eliminates the disadvantages of reduction in statistical power and under estimation of standard errors in single imputation. SAS, an established software having robust tools, and R, an open-source software where users can develop packages, differ in various aspects. Below is an attempt to apply multiple imputation using PROC MI and PROC MIANALYSE in SAS, and MICE package in R.
Multiple imputation
Step 1: Imputation
Missing values are imputed m times (m > 1), resulting in m complete data sets. The imputed values are drawn from distributions modelled specifically for each missing entry. The classic recommendation is 3 to 5 imputations for relative efficiency, but many recent studies recommend a larger number of imputations.
Step 2: Analysis
Each of these datasets is analysed using the statistical model chosen to answer the research question. Thus, creating m set of estimates of interest.
Step 3: Pooling
The m analysis results(estimates) are combined to one MI result(estimate).
Rubin´s Rules (RR) are designed to pool parameter estimates, such as mean differences, regression coefficients, standard errors, and to derive confidence intervals and p-values. Following are the rules for estimate and corresponding standard error.
For estimates:
Where m is the number of imputations and θi is the parameter estimate corresponding to the ith imputation. RR assumes that the parameter estimates are normally distributed.
For standard errors:
where VTotal is the total variance and VW is the within imputation variance and VB is the between imputation variance.
SEi is the Standard Error (SE) estimated corresponding to the ith imputation.
About data
For the discussion, we are considering data from a randomised study having three treatment groups – Treatment A, Treatment B and Placebo. PANSS (Positive and Negative Syndrome Scale) was used to assess the effectiveness of the treatment groups in comparison to Placebo. The assessment was repeated on Day 8, 15, 22 and 29 following the initial intake of the study drug. For reducing the effects of missing data, Multiple Imputation is used. Every missing value was imputed 15 times and thus creating 15 complete datasets. The imputed datasets were then analysed with an analysis of covariance (ANCOVA) model fitted for the fixed, categorical effects of treatment, and the continuous, fixed covariate of Baseline PANSS Total Score. The results were then pooled together to provide the combined estimates to compare the treatment groups with placebo.
Using SAS
Before starting the imputation procedure, the missing mechanism and pattern of the dataset is to be examined.
/*checking missing data pattern*/
proc mi data = panss_tr nimpute = 0;
var baseline day_8 day_15 day_22 day_29;
run;
Examination of the data showed non-monotone missing pattern. Under the assumption of MAR, fully conditional specification (FCS) method was used for imputation.
Step 1: Imputation
PROC MI is used for imputing 15 complete sets.
proc mi data = panss_tr out = panssmi seed = 27111433 nimpute = 15 noprint;
min = . 30 30 30 30
max = . 210 210 210 210;
class trt;
var trt baseline day_8 day_15 day_22 day_29;
fcs reg(/details);
run;
seed specifies a positive integer to begin the random number generation. It is recommended to specify the seed if we’ve to duplicate the results in similar situations or reproduce the results in the future.
nimpute specifies the number of imputations to be performed.
min, max options provide a range of values among which the substituted value for a missing point.
var specifies the variables.
fcs statement specifies an imputation based on fully conditional specification methods. reg option specifies the regression methods. Since we have not specified any model, SAS constructs the model by using the variables specified in var statement. details option displays the regression coefficients in each imputation.
Step 2: Analysis
For each of the 15 imputations ANCOVA model is fitted using PROC MIXED with the fixed, categorical effects of treatment, and the continuous, fixed covariate of Baseline PANSS Total Score. Least square means with corresponding Standard error is estimated for each treatment group.
data panss_mi;
set panssmi;
chg = day_29 – baseline;*change from baseline at Day 29;
run;
proc mixed data = panss_mi;
by _imputation_;
class trt;
model chg = trt baseline;
lsmeans trt/ cl;
ods output lsmeans = means;
run;
Step 2: Pooling
Estimates are obtained by pooling the analysis results, using the PROC MIANALYZE.
proc sort data = means;
by trt;
run;
proc mianalyze data = means;
by trt;
modeleffects estimate;
stderr stderr;
ods output ParameterEstimates = poolestimate;
run;
Using R
5 commonly used R packages for multiple imputations are:
- MICE
- Amelia
- missForest
- Hmisc
- mi
Under the assumption of MAR, MICE (Multivariate Imputation via Chained Equations) package is used.
#Load data
> library(haven)
> panss_tr <- read_sas(“C:/Desktop/panss_tr.sas7bdat”)
#Load ‘mice’ package
> library(mice)
#checking missing data pattern
> md.pattern(panss_tr)
#Change from baseline at day 29 is to be determined for modelling
> panss_tr$chg <- NA
#To add method for imputing change
> ini <- mice(panss_tr, maxit = 0)
> meth <- ini$meth
> meth[“CHG”] <- “~ I(DAY_29 – BASELINE)”
Step 1: Imputation
#15 imputations using mice function
> imputed_Data <- mice(panss_tr, m=15, maxit = 10, meth = meth, seed = 27111433)
Step 2: Analysis
# Model fitting and estimation for each imputed set
#Load ‘lsmeans’ package
> library(lsmeans)
#Imputed Dataframes dataimp1, dataimp2,… along with the corresponding estimates means1, means2,… are created
> for (i in 1:15)
> {
> Data <- complete(imputed_Data,i)
> fit <- with(data = Data, exp = lm(CHG ~ BASELINE + TRT))
> means <- summary(lsmeans(fit, specs = “TRT”))
> means$imp <- i
> assign(paste(‘dataimp’,i,sep=”),Data)
> assign(paste(‘means’,i,sep=”),means)
> }
Step 2: Pooling
#All 15 estimates in a single dataframe
means <- do.call(rbind, lapply( paste0(“means”, 1:15) , get))
#Combine estimates(LS means and SE) from 15 imputed sets
#Rubins Rule is employed to find the pooled estimates
> pool_mean <- data.frame(pestimate = tapply(means$lsmean, means$TRT, FUN = mean))
> variance_within <- data.frame(v_w = tapply(means$SE**2, means$TRT, FUN = mean))
> library(data.table)
> setDT(pool_mean, keep.rownames = “TRT”)
> setDT(variance_within, keep.rownames = “TRT”)
> m <- merge(pool_mean, means, by = “TRT”)
> variance_between <- data.frame(v_b = (tapply((m$lsmean – m$pestimate)**2, m$TRT, FUN = sum))/(15-1))
> setDT(variance_between, keep.rownames = “TRT”)
> poolvar <- merge(variance_within, variance_between, by = “TRT”)
poolvar$v_total <- poolvar$v_w + poolvar$v_b + poolvar$v_b/15
poolvar$pSE <- sqrt(poolvar$v_total)
The SAS vs R experience
MICE package assumes MAR mechanism for the data. To handle data with MNAR assumption, a package named miceMNAR is available in R. Whereas, in SAS, it can be handled in the mi procedure itself using the MNAR statement. The only requirement is the use of a MONOTONE or FCS statement also in proc MI. Another major difference is in the pooling phase. SAS facilitates the entire processes of imputation, analysis and pooling through in-built procedures and options while R requires the user to decide the analysis and pooling methods. The well documentation, organised nature, and availability of options in SAS add to the convenience and structured behaviour of programming while the open-source nature of R triggers the inquisitiveness of the user to dig deeper into the data, its statistical nature, and appropriate methods. It is when the data is understood well and the methods chosen considering the assumption, any statistical analysis can lead to meaningful interpretation. Knowledge of the statistical methods utilised by packages/procedures is an added advantage to a programmer who handles data of varied nature.