During a clinical trial, we want to make inferences about the value of some endpoint of interest which in this article we will call \(\theta\). In order for these inferences to be meaningful, we need to make sure that the we study enough subjects so that the estimate of the effect size is sufficiently precise. On the other hand, we do not want too many subjects because it would be unethical to expose subjects to the possibly harmful effects of the treatment or for them to be exposed to the risk of not receiving the standard of care.

An Overview of the Frequentist Approach

Under the frequentist framework, the sample size for a clinical trial is based on a hypothesis test of the form:

\begin{equation*} H_0: \theta \in \Theta_0 \text{ versus } H_1: \theta \in \Theta_1 \end{equation*}

where \(\Theta_0\) and \(\Theta_1\) are two sets that may contain \(\theta\). In this framework, we assume that the true, fixed value of \(\theta \in \Theta_1\). When we develop the hypotheses, we set a null value, denoted \(\theta_0\). This null value is the value that we will compare \(\theta\), or more specifically our estimate of :math`theta`, against. The selection of \(\theta_0\) is something done by the clinician. Define the effect size as

\begin{equation*} \epsilon = \theta - \theta_0 \textrm{ or } \epsilon = \frac{\theta}{\theta_0} \end{equation*}

as appropriate for the hypothesis of interest. When we determine the sample size needed for our trial, the effect size will be one component, but we need two more.

When we test our hypothesis, four outcomes are possible:

  1. We fail to reject \(H_0\) and \(H_0\) is true
  2. We reject \(H_0\) and \(H_0\) is true
  3. We reject \(H_0\) and \(H_0\) is false
  4. We fail to reject \(H_0\) and \(H_0\) is false

Rejecting \(H_0\) when it is true is a Type I error. Failing to reject \(H_0\) when it is false is a Type II error [1]. Let

\begin{equation*} \alpha = P(\textrm{Type I error}) \end{equation*}
\begin{equation*} \beta = P(\textrm{Type II error}) \end{equation*}

The upper bound for \(\alpha\) is the significance level of the test. Power is defined as \(1 - \beta\). When we design our study, we want to control these two values. We want \(\alpha\) to be below some value, often 0.05, and power to be above some value, usually 0.8 or 0.9. We should note that, holding everything else equal, as we power increases/decreases, \(\alpha\) increases/decreases.

Our hypothesis will be tested by comparing the test statistic calculated with the sample to its distribution under the null hypothesis. If it exceeds some critical value, we reject the null hypothesis. Otherwise we fail to reject it. Since we know the distribution of the test statistic under the null hypothesis, we can calculate \(\alpha\) and power for a give effect size and power. However, we already know the \(\alpha\) level and power we wish to obtain as well as the effect size we may see (or are targeting). Therefore, with a little algebra, we can solve for the sample size needed to satisfy the equation with our given values for \(\alpha\), power and effect size.

The Bayesian Approach

Of course the frequentist method does not work at all when we are using a Bayesian approach. Under the Bayesian framework, we are estimating the value of a parameter, not testing a hypothesis. As such, the mathematical concept of Type I and Type II errors do not apply. Instead, we can talk about credible intervals (CIs) or the posterior error.

The Posterior Credible Interval Approach

The posterior credible interval approach selects a sample size such that the length of the posterior credible interval is some given length \(l\). This length is selected so that the estimate of \(\theta\) has an acceptable precision. When we select the sample size, we need to consider what do we want to control: the coverage or the length of the interval. Joseph and Bélisle give three criteria to consider: the average coverage criteria, the average length criteria and the worst outcome criteria.

Average Coverage Criteria

Under the average coverage criteria (ACC), we select the minimum sample size such that the average coverage probability of the interval (which itself can be defined various ways) is at least \(1 - \alpha\), where \(\alpha\) is a pre-defined level of significance.

Let \(\mathscr{X}\) be the sample space from which we will draw our data and \(a\) be a statistic determined by the data. The ACC determines the sample size, \(n\) by solving

\begin{equation*} \int_{\mathscr{X}} \left\{ \int_a^{a + l} f(\theta|x, n) d\theta \right\} f(x) dx \ge 1 - \alpha \end{equation*}

where \(x \in \mathscr{X}\) is a generic data set with \(n\) observations and \(l\) is the length of the interval we are interested in. The value of \(a\) can be selected such that the interval \((a, a + l)\) is symmetric about the mean (as suggested by Adcock) or is the highest posterior density interval (Joseph, Wolfson and Berger). It should be noted that when the distribution is symmetric, these two ways to select \(a\) yield the same result.

Average Length Criteria

While ACC seeks to ensure a desired coverage probability of the posterior credible interval on average, the average length criteria (ALC) seeks to ensure that on average, the length of the interval will be \(l\). This is achieved by solving

\begin{equation*} \int_{\mathscr{X}} l'(x, n) f(x) dx \le l \end{equation*}

where \(\mathscr{X}\), \(x\) and \(n\) are defined as above and \(l'(x, n)\) is the length of the \(100(1 - \alpha)%\) posterior credible interval for \(x\), which can be calculated by

\begin{equation*} \int_a^{a + l'(x, n)} f(\theta|x, n) d\theta = 1 - \alpha \end{equation*}

where \(a\) can be chosen to make the interval symmetric about the mean or to make the interval the highest posterior density interval.

Worst Outcome Criteria

A more conservative case is to consider the worst outcome criteria (WOC). Let \(\mathscr{S}\) be the subset of \(\mathscr{X}\) such that

\begin{equation*} \int_{\mathscr{S}} f(x) dx = 1 - w \end{equation*}

where

\begin{equation*} f(x) = \int_{\Theta} f(x|\theta) f(\theta) d\theta, \end{equation*}

\(f(x) \ge f(y)\) for all \(x \in \mathscr{S}\) and \(y \notin \mathscr{S}\) and \(w > 0\) is chosen such that \(\mathscr{S}\) contains \(100(1 - w)%\) of possible \(x \in \mathscr{X}\).

The WOC selects the smallest sample size such that

\begin{equation*} \inf_{x \in \mathscr{S}} \left\{ \int_{a}^{a + l(x, n)} f(\theta|x, n) d\theta \right\} \ge 1 - \alpha \end{equation*}

where \(l(x, n)\) is the posterior credible interval for the data \(x\). \(w\) is often set to 0.05 or 0.1.

Choosing a Criteria

Unfortunately, there are no hard and fast rules to which criteria you should use. The choice of criteria depends on the goals of the trial you are designing. WOC will give a larger sample size for reasonable values of \(w\). That may or may not be desirable depending on what you are studying. ACC and ALC often give similar sample sizes except when \(\\alpha\) is small. My only rule would be to not just select the criteria based on which will give the smallest sample size; use the one whose statistical properties are appropriate for your trial.

The Posterior Error Approach

The posterior error approach treats the outcome of the clinical trial as a binary random variable. Using the notation of Lee and Zelen, whose 2000 paper elaborated this approach, we will define \(C\) to be a binary random variable that reflects the outcome of the clinical trial and \(T\) to be an indicator variable that reflects the true state of the hypothesis of interest. The values \(C\) and \(T\) can take are denoted as \(-\) and \(+\) to signify negative and positive results, respectively.

Under the frequentist framework, we can write:

\begin{equation*} \alpha = \mathrm{P}(C = +|T = -) \end{equation*}
\begin{equation*} \beta = \mathrm{P}(C = -|T = +) \end{equation*}

For the posterior error apporach we can define similar quantites:

\begin{equation*} \alpha^* = \mathrm{P}(T = +|C = -) \end{equation*}
\begin{equation*} \beta^* = \mathrm{P}(T = -|C = +) \end{equation*}

Define \(P_1 = 1 - \alpha^*\) and \(P_2 = 1 - \alpha^*\) and \(\theta = 1 - \mathrm{P}(T = +)\). Essentially, \(\theta\) reflects the subjective assessment that there is a difference between treatments in the clinical trial.

Thanks to Bayes Theorem, we can rewrite \(P_1\) and \(P_2\) in terms of \(\alpha\) and \(\beta\) and vice-verse. Thus we have

\begin{equation*} \alpha = \frac{(1 - P_2)(\theta + P_1 - 1)}{(1 - \theta)(P_1 + P_2 - 1)} \end{equation*}
\begin{equation*} \beta = \frac{(1 - P_1)(P_2 - \theta)}{\theta(P_1 + P_2 - 1)} \end{equation*}

and, after deciding upon an appropriate value of \(\theta\), \(P_1\), and \(P_2\), we can leverage the frequentist approach to calculate the appropriate sample size for the clinical trial.

Comparing the Frequentist and Bayesian Approaches

In elaborating Bayesian methods to determine the sample size needed to compare normal means, Joseph and Bélisle point out that the frequentist approach suffers from three drawbacks. First, the frequentist sample size calculation for tests on means is proportional to the square of the estimated value of \(\sigma\), a value that is not know with high precision [2]. The second issue is that the inferences are calculated on observed data, regardless of the supposed value of \(\sigma\) used in the sample size determination. Third, the frequentist approach ignores possibly available prior information about the value of the effect size which may cause the researcher to include more subjects than necessary.

The later two points are similar to general Bayesian arguments against using Frequentist methods. However, the first point does pose a problem for frequentist sample size determination. The selection of parameter values to use in the frequentists calculations can be a tough issue; often people use 'convenient' values that lead to a sample size and power that meet the budget and power expectations of the study or use values taken from pilot studies whose sample size is too small to estimate

On the other hand, the frequentist approach has a greater body of literature behind it and greater acceptance within the regulatory communities [3]. It is much easier to find a reference or calculator online <https://www.studydesign.io> to perform the calculations for frequentist sample size determination without having to resort to simulation.

The selection of which method to use depends on the methods you will use to analyze your data (e.g. you should not use Bayesian methods for sample size determination and frequentist methods for analysis of your primary endpoint) and the regulatory environment you find yourself in. Each method is a tool in our statistics toolbox so we cannot say one is better than the other, only that one is better than the other for this task in this set of circumstances.

Bayesian Sample Size Determination at StudyDesign.io

We are currently in the process of adding support for Bayesian Sample Size determination. We expect the posterior error approach will be added early February for all existing calculators. Support for posterior credible interval approaches will be rolled out early March.

If you have any questions, please let us know!

References

  • Adcock, C.J. (1988). A Bayesian approach to calculating sample size. Statistician, 37, 433–439.
  • Chow, S., Shao, J., and Wang, H. (2003), Sample size calculations in clinical research, New York: Marcel Dekker.
  • Joseph, L. and Bélisle, P. (1997). Bayesian sample size determination for normal means and differences between normal means. Statistician, 44, 209–226.
  • Joseph, L., Wolfson, D.B., and Berger, R.D. (1995). Sample size calculations for binomial proportions via highest posterior density intervals (with discussion). Journal of the Royal Statistical Society, Series D (The Statistician), 44, 143-154.
  • Lee, S.J. and Zelen, M. (2000). Clinical trials and sample size considerations: another prespective. Statistical Science, 15, 95-110.
[1]It can be argued that there are other kinds of errors that may occur under the frequestist framework, but for the purposes of this article, we will discuss the Type I and Type II errors only.
[2]Joseph and Bélisle are discussing inferences about means, so naturally they are interested in the parameters used in that analysis. Their points can be expanded to any analysis, but it helps here to discuss concrete things.
[3]This is changing rapidly. The FDA, for example, has published guidance on how to use Bayesian methods and the Bayesian approach to sample size determination and the EMA has suggested the use Bayesian methods in trials in small populations.