Introduction:
A critical step in clinical trial design is to accurately determine sample size. A wrong sample size can doom your study from beginning. If the sample size is too small, the population being examined will not be effectively represented and may risk the validity of study results. A large sample size magnifies statistical differences that are not clinically relevant. There are economic reasons as well. An undersized sample may waste resources as studies are underpowered. On the other hand, over-sized sample consume more resources than necessary. This problem can be addressed with the use of power analysis. To determine ideal sample size, power analysis combines statistical analysis and subject matter understanding. Optimal sample size assures adequate power to detect statistical significance. So, what exactly does the term “power” mean? Probability is represented by power. The likelihood of correctly rejecting the null hypothesis.
The following are some scenarios where sample size and power of the test are related and calculated using different functions in R.
- Sample size for comparing means
Assume we have two medications for a specific disease. It is administered to two distinct groups of people, and the time required to recover is determined. The aim is to see if there is a substantial difference in recovery time between two medicines. We are unaware of hypothesized population variance; so, it is a scenario where a two sample t- test is applicable.
We wish to know sample sizes for this study with 80% power, mean difference of 1 and pooled Standard deviation of 3.
Let’s see how function from R’s ‘stats’ package can be used to perform power analysis.
power.t.test () function can be used to compute power for a one or two sample t- test and also to compute sample size to obtain specified power. Following are the arguments of this function,
- delta : True difference in means
- sd : Standard deviation
- n : sample size
- sig.level: Type I error probability/ significance level
- power: Power of the test
- type: Type of t test
- alternative: one or two- sided test
#Generate the sample size for delta of 1, with SD of 3 and 80% power.
(ss1 <- power.t.test(delta = 1, sd = 3, power = 0.8))
##
## Two-sample t test power calculation
##
## n = 142.2466
## delta = 1
## sd = 3
## sig.level = 0.05
## power = 0.8
## alternative = two.sided
##
## NOTE: n is number in *each* group
ceiling(ss1$n)
## [1] 143
Required sample size is 143 for each group.
For same scenario if we have a pre-determined sample size of 143 then the resulting power of the test can be calculated using the same function.
#Use the sample size from above to show that it provides 80% power
power.t.test(n = 143, delta = 1, sd = 3)
##
## Two-sample t test power calculation
##
## n = 143
## delta = 1
## sd = 3
## sig.level = 0.05
## power = 0.802082
## alternative = two.sided
##
## NOTE: n is number in *each* group
1.1 Scenarios of one sample and paired t- test
In the above examples the test type is a “two.sample’ which is the default value for ‘type’ argument. Other values for ‘type’ arguments are ”one.sample” and “paired”. The following scenarios will help to understand the difference between types of t- test.
One sample t- test: Suppose the mean BMI for 30 hypersensitive males is 27 kg/m2 with SD of 4. We know that the mean BMI for a non-hypersensitive male is 25 kg/m2. If we want to check whether this sample of 30 hypersensitive males could have come from a population with mean BMI 25 kg/m2, a one sample t test can be used.
Paired t- test: Assume we have data on cholesterol levels taken from the same subjects at two different points in time. We’re looking for differences in cholesterol levels at these points in time. Here, a paired t- test is appropriate.
- Sample Size and Treatment Difference
In a power.t.test () delta is a treatment difference/ true difference between means. Delta is inversely proportional to sample size. As the treatment difference increases required sample size decreases. In the following graph, we can clearly see this relationship.
# Generate a vector containing values between 0.5 and 2.0, incrementing by 0.25
delta <- seq(from = 0.5, to = 2.0, by = 0.25)
# Specify the standard deviation and power
N <- unlist(lapply(delta, function(x) power.t.test(delta = x, sd = 3, power = 0.80)$n))
# Create a data frame for the deltas and sample sizes
(metaDF <- data.frame(delta, `Sample Size` = ceiling(N)))
## delta Sample.Size
## 1 0.50 567
## 2 0.75 253
## 3 1.00 143
## 4 1.25 92
## 5 1.50 64
## 6 1.75 48
## 7 2.00 37
- Sample Size for Comparing Proportions
Suppose we have two independent comparison groups and output is dichotomous i.e. success/ failure. Proportion of success between these groups can be compared using a proportion test.
Let’s look at the formula to calculate sample size for comparing proportions, n = (Zα/2+Zβ)2 * (p1(1-p1)+p2(1-p2)) / (p1-p2)2, where,
- Zα/2 is the critical value of the Normal distribution at α/2 (e.g. for a confidence level of 95%, α is 0.05 and the critical value is 1.96)
- Zβ is the critical value of the Normal distribution at β (e.g. for a power of 80%, β is 0.2 and the critical value is 0.84)
- p1 and p2 are the expected sample proportions of the two groups.
Use the power.prop.test() to calculate the sample size needed for a trial with a recovery percentage of 40% and 60% in the placebo and active treatment groups, respectively, and 80% power.
# Use the power.prop.test to generate sample sizes for the proportions
power.prop.test(p1 = 0.4, p2 = 0.6, power = 0.8)
##
## Two-sample comparison of proportions power calculation
##
## n = 96.92364
## p1 = 0.4
## p2 = 0.6
## sig.level = 0.05
## power = 0.8
## alternative = two.sided
##
## NOTE: n is number in *each* group
Required sample size is 97 in each group.
Next, we will find the minimum detectable percentage of the above using 150 patients per group.
# Find the minimum detectable percentage of the above using 150 patients per group
power.prop.test(p1 = 0.4, power = 0.8, n = 150)$p2 * 100
## [1] 56.0992
If the placebo recovery percentage is 40%, then with 97 patients per group, we have 80% power to detect a treatment difference of 20% (since 60% recovery in the active arm). With 150 patients per arm, we can detect a smaller difference, ~16% (since 56% in the active arm).
- Sample Size for Unequal Groups
Unequally sized groups are common in research and may be the result of simple randomization, planned differences in group size or study dropouts. Unequal sample sizes can lead to:
- Unequal variances between samples, which affects the assumption of equal variances in tests like ANOVA. Having both, unequal sample sizes and variances dramatically affects statistical power and Type I error rates.
- A general loss of power. Equal-sized groups maximize statistical power.
- Issues with confounding variables.
Let’s see the formula to calculate the sample size for unequal groups, n = (r+1)σ2(Zα/2+Z1-β)2 / (rd2) where,
- Zα is the normal deviate at a level of significance (Zα is 1.96 for 5% level of significance and 2.58 for 1% level of significance)
- Z1-β is the normal deviate at 1-β% power with β% of type II error (0.84 at 80% power and 1.28 at 90% statistical power).
- r = n1/n2 is the ratio of sample size required for 2 groups, generally it is one for keeping equal sample size for 2 groups. If r = 0.5, it gives the sample size distribution as 1:2 for 2 groups.
- σ and d are the pooled standard deviation and difference of means of 2 groups.
These values are obtained from either previous studies of similar hypotheses or conducting a pilot study.
Calculate the sample size required for a trial where patients will be randomized to active treatment and placebo in a 2:1 ratio for a two-sided t-test.
# Use the *samplesize* library
library(samplesize)
# Use 90% power, delta 1.5, standard deviation of 2.5, fraction of 0.5
unequalgps <- n.ttest(power = 0.9, alpha = 0.05, mean.diff = 1.5, sd1 = 2.5, sd2 = 2.5, k = 0.5, design = “unpaired”, fraction = “unbalanced”)
unequalgps
## $`Total sample size`
## [1] 135
##
## $`Sample size group 1`
## [1] 90
##
## $`Sample size group 2`
## [1] 45
##
## $Fraction
## [1] 0.5
- Sample Size for One-sided Tests
Sample size is also affected by type of alternative hypothesis. Hence it is required to define the alternate hypothesis at the beginning of study. The following image shows how sample size changes when the alternative hypothesis is changed.
# Generate sample sizes comparing the proportions using a two-sided test
(two_sided <- power.prop.test(p1 = 0.1, p2 = 0.3, power = 0.80, alternative = “two.sided”))
##
## Two-sample comparison of proportions power calculation
##
## n = 61.5988
## p1 = 0.1
## p2 = 0.3
## sig.level = 0.05
## power = 0.8
## alternative = two.sided
##
## NOTE: n is number in *each* group
ceiling(two_sided$n)
## [1] 62
# Repeat using a one-sided test
(one_sided <- power.prop.test(p1 = 0.1, p2 = 0.3, power = 0.80, alternative = “one.sided”))
##
## Two-sample comparison of proportions power calculation
##
## n = 48.40295
## p1 = 0.1
## p2 = 0.3
## sig.level = 0.05
## power = 0.8
## alternative = one.sided
##
## NOTE: n is number in *each* group
ceiling(one_sided$n)
## [1] 49
# Display the reduction per group
ceiling(two_sided$n) – ceiling(one_sided$n)
## [1] 13
Here for first case alternative hypothesis is two sided and required sample size is 62 for each group. When this hypothesis is changed to one sided, required sample size reduces to 49 for each group and this change is very significant. Hence it is important to decide the alternative hypothesis.
- Stopping Rules
One of the reliable, reasonable ways to conduct clinical trials that incorporate what is discovered throughout a clinical study and how it is concluded without compromising the validity or integrity is interim analysis. Interim analyses are usually performed to test for futility, safety and efficiency. Suppose there is convincing evidence of efficacy in an interim analysis, which may terminate a trial early to prevent prolonging the study.
The most conventional interim analysis is to have a predetermined number of interim analyses and specified timing for them, known as Repeated Significance Tests. It essentially means that part of the alpha is spent (or allocated) to the interim analysis, so that at the end of the study, the combined alpha of the interim analysis plus the final analysis do not exceed 0.05. The three most commonly used techniques are,
- Pocock
- Haybittle/Peto
- O’Brien/Fleming
The differences among the three techniques mostly concern how much of the alpha is spent and when.
- In Pocock same alpha values at each analysis are used.
- In O’Brien/Fleming alpha value at initial analysis is low, and later it gradually increases. Hence, with this approach it is difficult to determine statistical significance at an early stage.
Here, we will calculate the p-values required by the Pocock and O’Brien-Fleming spending functions to end a trial early. The gsDesign() function provides sample size and boundaries for a group sequential design based on treatment effect, spending functions for boundary crossing probabilities, and relative timing of each analysis.
# Use the gsDesing function to generate the p-values for four analyses under the Pocock rule
(Pocock <- gsDesign(k = 4, test.type = 2, sfu = “Pocock”))
## Symmetric two-sided group sequential design with
## 90 % power and 2.5 % Type I Error.
## Spending computations assume trial stops
## if a bound is crossed.
##
## Sample
## Size
## Analysis Ratio* Z Nominal p Spend
## 1 0.296 2.36 0.0091 0.0091
## 2 0.592 2.36 0.0091 0.0067
## 3 0.887 2.36 0.0091 0.0051
## 4 1.183 2.36 0.0091 0.0041
## Total 0.0250
##
## ++ alpha spending:
## Pocock boundary.
## * Sample size ratio compared to fixed design with no interim
##
## Boundary crossing probabilities and expected sample size
## assume any cross stops the trial
##
## Upper boundary (power or Type I Error)
## Analysis
## Theta 1 2 3 4 Total E{N}
## 0.0000 0.0091 0.0067 0.0051 0.0041 0.025 1.1561
## 3.2415 0.2748 0.3059 0.2056 0.1136 0.900 0.6975
##
## Lower boundary (futility or Type II Error)
## Analysis
## Theta 1 2 3 4 Total
## 0.0000 0.0091 0.0067 0.0051 0.0041 0.025
## 3.2415 0.0000 0.0000 0.0000 0.0000 0.000
2*(1-pnorm(Pocock$upper$bound))
## [1] 0.01821109 0.01821109 0.01821109 0.01821109
# Repeat for the O’Brein & Fleming Rule
(OF <- gsDesign(k = 4, test.type = 2, sfu = “OF”))
## Symmetric two-sided group sequential design with
## 90 % power and 2.5 % Type I Error.
## Spending computations assume trial stops
## if a bound is crossed.
##
## Sample
## Size
## Analysis Ratio* Z Nominal p Spend
## 1 0.256 4.05 0.0000 0.0000
## 2 0.511 2.86 0.0021 0.0021
## 3 0.767 2.34 0.0097 0.0083
## 4 1.022 2.02 0.0215 0.0145
## Total 0.0250
##
## ++ alpha spending:
## O’Brien-Fleming boundary.
## * Sample size ratio compared to fixed design with no interim
##
## Boundary crossing probabilities and expected sample size
## assume any cross stops the trial
##
## Upper boundary (power or Type I Error)
## Analysis
## Theta 1 2 3 4 Total E{N}
## 0.0000 0.000 0.0021 0.0083 0.0145 0.025 1.0157
## 3.2415 0.008 0.2850 0.4031 0.2040 0.900 0.7674
##
## Lower boundary (futility or Type II Error)
## Analysis
## Theta 1 2 3 4 Total
## 0.0000 0 0.0021 0.0083 0.0145 0.025
## 3.2415 0 0.0000 0.0000 0.0000 0.000
2 * (1 – pnorm(OF$upper$bound))
## [1] 5.152685e-05 4.199337e-03 1.941553e-02 4.293975e-02
The outputs show the derived boundaries of the the p-values calculated. Here we considered two symmetric two-sided group sequential design with 90% power and 2.5% Type I error using Pocock and O’Brein & Fleming(OF) Rules. From the above results it’s clear that if we followed the O’Brein & Fleming rule, we will get a better probability. The sample size calculation using these two methods are explained in the below example 7.
- Sample Size Adjustments for Interim Analyses
A trial was originally planned to have no interim analyses. A sample size calculation estimated 500 patients were needed for 90% power at a 5% siginificance level.
Here we will derive the new sample size requirements if three interim analyses are planned with the potential to stop early under the Pocock and O’Brien-Fleming spending functions.
# Use the gsDesign function to generate the sample sizes at each stage under the Pocock rule
Pocock.ss <- gsDesign(k = 4, test.type = 2, sfu = “Pocock”, n.fix = 500, beta = 0.1)
ceiling(Pocock.ss$n.I)
## [1] 148 296 444 592
# Repeat for the O’Brein-Feming rule
OF.ss <- gsDesign(k = 4, test.type = 2, sfu = “OF”, n.fix = 500, beta = 0.1)
ceiling(OF.ss$n.I)
## [1] 128 256 384 512
Under the Pcocock rule, we would need to increase the total sample size to 592 and it would be 512 if we followed the O’Brein & Fleming rule.
- Sample Size for Equivalent Binary Outcomes
Equivalence Test
Tests that allow us to conclude equivalence (e.g., there is no difference between two treatments) with specified confidence. These tests are performed to provide support for absence of meaningful effects.
R’s ‘TOSTER’ package is Two one sided test procedure to test equivalence for t- test, correlation, difference between proportions, meta-analyses, including power analysis for t- test and correlations.
In the following example, we will see how function from ‘TOSTER’ package is used to determine sample size for equivalence test. Here the aim is to find sample size for two groups for expected rates of 60%, 4% delta, 90% power and 5% alpha (significance level). Equivalence interval is (-4, 4).
Calculate the required sample size for an equivalence trial given the power and delta. Also calculate the statistical power given the sample size.
library(TOSTER)
# Find the sample size per group for expected rates of 60%, 4% delta, 90% power and 5% significance level
powerTOSTtwo.prop(alpha = 0.05, statistical_power = 0.90,
prop1 = 0.60, prop2 = 0.60,
low_eqbound_prop = -0.04, high_eqbound_prop = 0.04)
## The required sample size to achieve 90 % power with equivalence bounds of -0.04 and 0.04 is 3247
##
## [1] 3246.652
For 90% power with equivalence interval of (-4, 4); required sample size is 3247.
Now we will see how the sample size affects the power in the next example.
#Find the power if the above trial is limited to 2500 per group
powerTOSTtwo.prop(alpha = 0.05, N=2500,
prop1 = 0.60, prop2 = 0.60,
low_eqbound_prop = -0.04, high_eqbound_prop = 0.04)
## The statistical power is 78.57 % for equivalence bounds of -0.04 and 0.04 .
##
## [1] 0.7857316
When sample size is reduced to 2500 from 3247, power became 78.57%. Thus, sample size and power are highly related to each other.
- Sample size for Equivalence Continuous Outcomes
Like the above scenario, sample size can be calculated for equivalence continuous outcomes.
# Find the sample size for a standard deviation of 10, delta of 2, 80% power and 5% significance level
(powerTOSTtwo.raw(alpha = 0.05, statistical_power = 0.80,
sdpooled = 10, low_eqbound = -2, high_eqbound = 2))
## The required sample size to achieve 80 % power with equivalence bounds of -2 and 2 is 428.1924 per group, or 858 in total.
## [1] 428.1924
Required sample size is 429 per group.
Using the same equivalence test, we can calculate sample sizes for standard deviation values ranging from 7 to 13, 2% delta, 80% power and 5% alpha. Plot of obtained sample size vs Standard deviation is given below,
# Find the sample sizes based on standard deviations between 7 and 13
stdev <- seq(7, 13, 1)
npergp <- unlist(lapply(stdev, function(x)
powerTOSTtwo.raw(alpha = 0.05, statistical_power = 0.80,
sdpooled = x, low_eqbound = -2, high_eqbound = 2)))
#for (i in 1:length(stdev)) {
# npergp[i] <- ceiling(powerTOSTtwo.raw(alpha = 0.05, statistical_power = 0.80,
# sdpooled = stdev[i], low_eqbound = -2, high_eqbound = 2))
#}
#(sample.sizes <- data.frame(stdev, npergp))
(metaDF1 <- data.frame(stdev, `Sample Size` = ceiling(npergp)))
## stdev Sample.Size
## 1 7 210
## 2 8 275
## 3 9 347
## 4 10 429
## 5 11 519
## 6 12 617
## 7 13 724
As the standard deviation increases sample size per group also increases.
Determining the sample size is a crucial stage in the design of a research study. Appropriate sample sizes are necessary to conclude that sample estimates represent underlying population parameters confidently. The power of a test determines the sample size needed to reject or accept a study hypothesis. A study that is statistically scannable has a chance of addressing the questions posed at the start of the research investigation. Studies that are too small can lead to investigators making irrational assumptions regarding the efficacy of the trial treatment. The most frequent errors in clinical research involve the following:
- Misjudging the underlying variability for parameter estimates.
- Making incorrect assumptions about the follow-up period to observe the intended effects of the treatment.
- Being unable to predict a subject’s lack of compliance with the study regimen.
- Having a high dropout rate.
- Failing to take into account the study subject’s diversity of endpoints.
A study that has little chance of proving the theory is a waste of time and money. Participants may also be exposed to potential harm or have unrealistic expectations of therapeutic benefits. Since scientific and ethical issues are intertwined, using appropriate sampling techniques and determining the minimum required sample size are critical for obtaining scientifically and statistically sound results. A sufficient sample size and excellent data-gathering techniques may save resources in addition to producing more reliable, valid, and generalizable results. This article was written to help a researcher organise and conduct adequate research. Because of its abundance of libraries for various statistical capabilities, R is a useful tool for statistical analysis and one of the best and simplest for analysing sample size and power calculations.