**Author**: Genpro Statistics Team

Multiple imputation is highly recommended as a valid method for handling missing data. It eliminates the disadvantages of reduction in statistical power and under estimation of standard errors in single imputation. SAS, an established software having robust tools, and R, an open-source software where users can develop packages, differ in various aspects. Below is an attempt to apply multiple imputation using PROC MI and PROC MIANALYSE in SAS, and MICE package in R.

**Multiple imputation**

Step 1: Imputation

Missing values are imputed m times (m > 1), resulting in m complete data sets. The imputed values are drawn from distributions modelled specifically for each missing entry. The classic recommendation is 3 to 5 imputations for relative efficiency, but many recent studies recommend a larger number of imputations.

Step 2: Analysis

Each of these datasets is analysed using the statistical model chosen to answer the research question. Thus, creating m set of estimates of interest.

Step 3: Pooling

The m analysis results(estimates) are combined to one MI result(estimate).

Rubin´s Rules (RR) are designed to pool parameter estimates, such as mean differences, regression coefficients, standard errors, and to derive confidence intervals and p-values. Following are the rules for estimate and corresponding standard error.

*For estimates:*

Where m is the number of imputations and θi is the parameter estimate corresponding to the i^{th} imputation. RR assumes that the parameter estimates are normally distributed.

*For standard errors:*

where V_{Total} is the total variance and V_{W} is the within imputation variance and V_{B} is the between imputation variance.

SE_{i }is the Standard Error (SE) estimated corresponding to the i^{th} imputation.

**About data**

For the discussion, we are considering data from a randomised study having three treatment groups – Treatment A, Treatment B and Placebo. PANSS (Positive and Negative Syndrome Scale) was used to assess the effectiveness of the treatment groups in comparison to Placebo. The assessment was repeated on Day 8, 15, 22 and 29 following the initial intake of the study drug. For reducing the effects of missing data, Multiple Imputation is used. Every missing value was imputed 15 times and thus creating 15 complete datasets. The imputed datasets were then analysed with an analysis of covariance (ANCOVA) model fitted for the fixed, categorical effects of treatment, and the continuous, fixed covariate of Baseline PANSS Total Score. The results were then pooled together to provide the combined estimates to compare the treatment groups with placebo.

**Using SAS**

Before starting the imputation procedure, the missing mechanism and pattern of the dataset is to be examined.

/*checking missing data pattern*/

**proc** **mi** data = panss_tr nimpute = **0**;

var baseline day_8 day_15 day_22 day_29;

**run**;

Examination of the data showed non-monotone missing pattern. Under the assumption of MAR, fully conditional specification (FCS) method was used for imputation.

**Step 1: Imputation **

**PROC MI** is used for imputing 15 complete sets.

**proc** **mi** data = panss_tr out = panssmi seed = **27111433** nimpute = **15** noprint;

min = **.** **30** **30** **30** **30**

max = **.** **210** **210** **210** **210**;

class trt;

var trt baseline day_8 day_15 day_22 day_29;

fcs reg(/details);

**run**;

** seed** specifies a positive integer to begin the random number generation. It is recommended to specify the seed if we’ve to duplicate the results in similar situations or reproduce the results in the future.

** nimpute** specifies the number of imputations to be performed.

** min, max** options provide a range of values among which the substituted value for a missing point.

** var** specifies the variables.

** fcs** statement specifies an imputation based on fully conditional specification methods.

**option specifies the regression methods. Since we have not specified any model, SAS constructs the model by using the variables specified in var statement.**

*reg***option displays the regression coefficients in each imputation.**

*details***Step 2: Analysis**

For each of the 15 imputations ANCOVA model is fitted using **PROC MIXED** with the fixed, categorical effects of treatment, and the continuous, fixed covariate of Baseline PANSS Total Score. Least square means with corresponding Standard error is estimated for each treatment group.

**data** panss_mi;

set panssmi;

chg = day_29 – baseline;*change from baseline at Day 29;

**run**;

**proc** **mixed** data = panss_mi;

by _imputation_;

class trt;

model chg = trt baseline;

lsmeans trt/ cl;

ods output lsmeans = means;

**run**;

**Step 2: Pooling**

Estimates are obtained by pooling the analysis results, using the PROC MIANALYZE.

**proc** **sort** data = means;

by trt;

**run**;

**proc** **mianalyze** data = means;

by trt;

modeleffects estimate;

stderr stderr;

ods output ParameterEstimates = poolestimate;

**run**;

**Using R**

5 commonly used R packages for multiple imputations are:

- MICE
- Amelia
- missForest
- Hmisc
- mi

Under the assumption of MAR, MICE (Multivariate Imputation via Chained Equations) package is used.

#Load data

> library(haven)

> panss_tr <- read_sas(“C:/Desktop/panss_tr.sas7bdat”)

#Load ‘mice’ package

> library(mice)

#checking missing data pattern

> md.pattern(panss_tr)

#Change from baseline at day 29 is to be determined for modelling

> panss_tr$chg <- NA

#To add method for imputing change

> ini <- mice(panss_tr, maxit = 0)

> meth <- ini$meth

> meth[“CHG”] <- “~ I(DAY_29 – BASELINE)”

**Step 1: Imputation **

#15 imputations using mice function

> imputed_Data <- mice(panss_tr, m=15, maxit = 10, meth = meth, seed = 27111433)

**Step 2: Analysis**

# Model fitting and estimation for each imputed set

#Load ‘lsmeans’ package

> library(lsmeans)

#Imputed Dataframes dataimp1, dataimp2,… along with the corresponding estimates means1, means2,… are created

> for (i in 1:15)

> {

> Data <- complete(imputed_Data,i)

> fit <- with(data = Data, exp = lm(CHG ~ BASELINE + TRT))

> means <- summary(lsmeans(fit, specs = “TRT”))

> means$imp <- i

> assign(paste(‘dataimp’,i,sep=”),Data)

> assign(paste(‘means’,i,sep=”),means)

> }

**Step 2: Pooling**

#All 15 estimates in a single dataframe

means <- do.call(rbind, lapply( paste0(“means”, 1:15) , get))

#Combine estimates(LS means and SE) from 15 imputed sets

#Rubins Rule is employed to find the pooled estimates

> pool_mean <- data.frame(pestimate = tapply(means$lsmean, means$TRT, FUN = mean))

> variance_within <- data.frame(v_w = tapply(means$SE**2, means$TRT, FUN = mean))

> library(data.table)

> setDT(pool_mean, keep.rownames = “TRT”)

> setDT(variance_within, keep.rownames = “TRT”)

> m <- merge(pool_mean, means, by = “TRT”)

> variance_between <- data.frame(v_b = (tapply((m$lsmean – m$pestimate)**2, m$TRT, FUN = sum))/(15-1))

> setDT(variance_between, keep.rownames = “TRT”)

> poolvar <- merge(variance_within, variance_between, by = “TRT”)

poolvar$v_total <- poolvar$v_w + poolvar$v_b + poolvar$v_b/15

poolvar$pSE <- sqrt(poolvar$v_total)

**The SAS vs R experience**

MICE package assumes MAR mechanism for the data. To handle data with MNAR assumption, a package named miceMNAR is available in R. Whereas, in SAS, it can be handled in the mi procedure itself using the MNAR statement. The only requirement is the use of a MONOTONE or FCS statement also in proc MI. Another major difference is in the pooling phase. SAS facilitates the entire processes of imputation, analysis and pooling through in-built procedures and options while R requires the user to decide the analysis and pooling methods. The well documentation, organised nature, and availability of options in SAS add to the convenience and structured behaviour of programming while the open-source nature of R triggers the inquisitiveness of the user to dig deeper into the data, its statistical nature, and appropriate methods. It is when the data is understood well and the methods chosen considering the assumption, any statistical analysis can lead to meaningful interpretation. Knowledge of the statistical methods utilised by packages/procedures is an added advantage to a programmer who handles data of varied nature.