## I.10 |
## Clinical Study Design for Translational Research |

## A. Introduction

Experimentation has been a primary tool of science for centuries. The scientific method begins by postulating a question or hypothesis, followed by gathering experimental evidence and analyzing the results to support or deny that hypothesis.

The recommendation and use of experimentation in the medical sciences dates back at least to the ancient Greeks, with the physician and philosopher Galen of Pergamon (b. AD 129) as a notable example (1). Evolving continuously since those early efforts, experimental science eventually led to an emphasis on randomized experiments, in which two or more interventions are tested against one another to assess their relative benefits.

Randomization was heavily influenced and promoted by the statistician R.A. Fisher (2), especially through his work during the 1930s, and in the decades to follow the use of randomized experiments grew substantially based on their ability to minimize bias and isolate treatment effect from other nuisance factors. Randomized controlled trials (RCTs), in which the randomization typically compares a novel intervention to an established control treatment, have since become the gold standard for the design of clinical trials.

This chapter provides an overview of different study designs used in translational research along with the statistical consequences, and how the research question at hand helps determine the choice of design.

## B. The Primary Research Question

No matter what the area or stage of research, scientific studies are generally constructed around answering a primary research question, which informs all aspects of the study design and the conclusions drawn. The focus on a specific question helps to provide a clear, objective answer that can then be used to advance clinical care and to promote additional research.

There are several research questions common in translational clinical research:

- Is the therapy sufficiently safe to begin pre-clinical testing?
- Is the therapy sufficiently safe to begin initial clinical testing?
- How can the therapy best be refined?
- Can the therapy work under ideal conditions?
- Can the therapy work under real-world conditions?

These are motivated by a scientific desire to understand and explain phenomena, tailored to the unique needs of translational clinical science. These questions are usually asked and answered in the order listed, but unexpected results or new findings may require the need to revisit questions that were previously addressed. This feedback loop is characteristic of translational research. For example, a sufficient degree of safety or efficacy may be required to move an exploratory therapy onto the next phase. The framework of pre-clinical and Phase I-IV studies from the pharmaceutical arena roughly corresponds to these five questions. In medical devices, pre-clinical, pilot/feasibility, pivotal, and post-market studies serve the same purposes. In medical devices the pivotal studies are typically compressing Phases II and III.

Other questions common to translational clinical research include the following:

- Are there better ways to measure the disease process?
- Are there better ways to measure a treatment effect?

Variations on these questions might also be applied to a novel diagnostic product or biomarker as part of translational research.

Despite the benefits of a primary research hypothesis, therapies may have multiple modes of action and multiple positive or negative side effects, and there may be strong interest in these questions as well. Organizing and prioritizing research is often a first step in determining the appropriate study design. Once a question is defined, well-developed statistical concepts can be used to turn the question into a workable scientific study.

## C. Basic Statistical Concepts

Here we briefly cover key statistical topics needed to understand study design. Textbooks that cover these topics in greater depth can be consulted for additional mathematical detail (3,4).

### P-values and hypothesis testing

The most memorable content of a statistics course is usually the concept of a p-value. For better or for worse, many people think of statistics in the context of clinical research in the form of a single question: “Is the p-value statistically significant or not?”

The technical definition of a p-value is awkward; it is the probability of observing an effect that is as or more extreme as what was found, given that no effect really exists on repeated experimentation. That is, in statistics we seek to prove that an effect exists by starting with the assumption that it does not (5) and then comparing our assumption to the data observed. The assumption of no effect is called the null hypothesis, and if the data disagree strongly enough with this assumption we say that the null hypothesis has been rejected in favor of the alternative hypothesis of a therapeutic effect.

In a randomized controlled trial, our starting point is the assumption that the novel treatment is no different (no better, usually) than the control so that outcomes in the two treatment arms of the trial should be roughly the same. We then examine the evidence collected in the trial and see how strongly the data agree or disagree with that assumption.

The level of disagreement is conveniently summarized by the p-value, which sits between 0 and 1, with low values representing strong disagreement with the hypothesis of no treatment effect. Statistical significance is typically defined by a p-value less than a pre-defined cutoff value. Statistical theory does not mandate a particular cutoff, but tradition and precedent have established 0.05, or 5%, as the most often chosen value. As with many things in modern statistics, this definition is largely due to Fisher (5). If one in twenty does not seem high enough odds, we may, if we prefer, draw the line at one in fifty (the 2% point), or one in a hundred (the 1% point).

In practical terms, if a p-value is small, it means that the results observed are unlikely to have arisen from chance and so we conclude that the therapy has some effect. How large that effect is, and whether it is clinically relevant rather than just statistically significant, are questions the p-value cannot answer.

For regulators and the general public, a risk of great concern is that therapies that do NOT work DO find their way into the marketplace. In statistics, this is called a Type I error — finding a result to be statistically significant when no underlying effect is present. The p-value is the gatekeeper of this error; when we say that a result is statistically significant because a p-value falls below 0.05, we are agreeing to a 5% risk that a treatment with no true effect will nevertheless be accepted. Despite this low theoretical risk of falsely concluding an effect, considerable concern has arisen about the reproducibility of scientific research findings. Ioannadis summarized the issue as follows (6):

“The probability that a research claim is true may depend on study power and bias, the number of other studies on the same question, and, importantly, the ratio of true to no relationships among the relationships probed in each scientific field. In this framework, a research finding is less likely to be true when the studies conducted in a field are smaller; when effect sizes are smaller; when there is a greater number and lesser pre-selection of tested relationships; where there is greater flexibility in designs, definitions, outcomes, and analytical modes; when there is greater financial and other interest and prejudice; and when more teams are involved in a scientific field in chase of statistical significance. Simulations show that for most study designs and settings, it is more likely for a research claim to be false than true.”

Thus, scientific judgment is required to weigh the underlying issues of design and conduct along with the strength of the results, and a significant p-value cannot be used as a singular stamp of approval. Similarly, failure to achieve a significant p-value for the primary hypothesis does not mean that there is nothing of scientific value to be obtained from a study.

### Confidence intervals

A concept closely related to the p-value is the confidence interval. This is a range of numbers that helps quantify statistical uncertainty around an estimate. Confidence intervals can quantify the uncertainty for a proportion, a mean value, or any other statistical estimate including sophisticated quantities derived from regression models. The most common example from daily life can be found in news reports any recent opinion poll; the “margin of error” associated with a poll result corresponds to the confidence interval. In the scientific literature, this may be reported as plus/minus value around an estimate.

Clearly, we’d like to be more confident in our beliefs and conclusions, so we tend to prefer a smaller interval because it provides more precision (less uncertainty) than a larger interval. The mathematical definition of a confidence interval, like that of a p-value, somewhat awkward; it is an interval formed so that upon repeated experimentation, such a range would have a particular probability of covering the true value of a parameter being estimated. The width of a confidence interval is largely driven by two concepts; the variability in the data, and the sample size. More variability or a smaller sample size produces a larger confidence interval, while less variability or a larger sample size produces a smaller confidence interval. The degree of confidence we wish to use in the calculation of the interval, also known as the confidence level, also plays a role. Like with p-values, a 5% error rate is commonly employed which yields a 95% confidence interval (calculated as 100% minus the 5% error rate).

This calculation illustrates the equivalence of confidence intervals to p-values; if a confidence interval does not contain a particular value, one would reject the null hypothesis for that value at the corresponding confidence level. One must still be careful when interpreting confidence intervals in this fashion to distinguish between a pre-specified versus post hoc null hypothesis, as a confidence interval may easily be compared to any particular null hypothesis after it has been calculated. Another note of caution in interpreting confidence intervals is warranted regarding interpretation of multiple confidence intervals. In a randomized trial, one usually produces a confidence interval for the treatment effect based on a comparison of the active treatment to the control treatment. This single interval has the interpretation as described above. Another option is to compute confidence intervals separately for the treatment and control groups. Naturally, this leads to a desire to compare the two confidence intervals to understand the treatment effect, but the proper interpretation of each interval must be made on its own. The overlap or lack of overlap of these two intervals does not perfectly correlate with the significance of a p-value for the difference between the treatment and control group. If the two confidence intervals do not overlap, one may conclude that there is a significant difference between the groups. The converse is not true; two confidence intervals may overlap, but there still may be a significant difference between the two groups.

### Multiple testing in clinical trials

One increasingly recognized issue with the interpretation of trial outcomes is related to the concept of multiple testing. As clinical trials are often time-consuming and expensive to execute and subtle to interpret, there is a natural tendency to insert many endpoints or multiple evaluations of endpoints, either with different statistical analyses or by taking repeated looks at the data over time.

Now, it is inherently sensible to maximize the scientific return from any experiment, especially ones that entail risk to the patient. Apart from benefiting the study’s backers, this desire can benefit patients and enhance the ethical basis for a study by extracting more information. However, repeated analyses on the same experiment will eventually produce an apparently-novel finding due to chance (7).

In clinical trials, this problem often occurs in the evaluation of secondary endpoints or subgroups. For instance, a study with 10 secondary endpoints (not an unusual number for large or ambitious trials) has approximately a 40% chance of seeing at least one turn up statistically significant when no real effect exists. (As an aside for the mathematically inclined reader, this can be calculated as 1-(1-0.05)^{10} assuming 10 tests each at a p-value cutoff of 0.05). Such false positives are undesirable from a scientific standpoint and from a regulatory perspective.

Also common is the potential inflation of Type I error due to repeated analyses of study data over time. Waiting until the completion of a long, large or expensive trial to gain insight into its progress is often undesirable, so interim analyses – analyses of partial data while the study is ongoing — are common. If these analyses are conducted without planning to manage the statistical consequences, phantom results may compromise the credibility of the study.

A host of methods have been developed to handle these issues; these include group-sequential methods for repeated looks at study data over time and multiple comparison procedures to deal with Type I error inflation due to a multitude of study endpoints. Additional methods that have been increasingly used in recent years include adaptive study design and Bayesian methods (8,9).

### Sample size and statistical power

After a primary research question is determined, the sample size of an experiment is motivated by the desire to obtain a significant p-value. There is often interplay with these aspects; different statistical analysis methods and choices regarding study design and sample size may influence the research question and vice versa. For example, a particular research question may require a larger sample size than is practical and an alternate question or a variation on the ideal question may be used instead.

A common example in clinical trials is the desire to demonstrate a mortality benefit for a novel therapy; for many medical conditions, mortality rates are low enough to make designing a trial around the primary endpoint of mortality impractical, and surrogates such as disability, clinical functional status, or biological metrics such as blood pressure, cholesterol level, or enzyme levels are used instead.

Similarly, the choice of analysis method may be influenced by both practical concerns regarding interpretation of results and sample size. A classic example is the choice between continuous and binary outcomes; a continuous outcome, e.g. average change in total cholesterol, may allow for a smaller sample size but a binary outcome, e.g. percent of patients experiencing a clinically significant reduction in cholesterol, may be considered more clinically relevant and appropriate.

Statistical power is defined as the probability of detecting a significant result given that it truly exists, and is driven by both the true effect size of the therapy (which we do not know in advance but can attempt to guess) and the sample size of the trial. Naturally, a study with a larger sample size will provide more power. Similarly, a therapy that is more effective than another will provide a study with more power, all else being equal.

The opposite of power – failing to find a statistically significant result when a treatment effect exists — is called Type II error. If a therapy is effective, 1 minus power is the rate of Type II error — the probability of failing to find that effect due to random chance. Type II error is risk from a sponsor or investigator’s perspective. Just as regulators attempt to avoid approving unsafe or inefficacious therapies, study sponsors wish to avoid spending time and resources on a trial that does not meet its scientific and statistical objectives.

In a perfect world, every study would be large enough in size and long enough in duration to have a minimal likelihood of failure due to chance alone. In practice, of course, we must compromise between our desire to minimize the play of chance and the cost of upsizing a study. One hundred percent power is by definition impossible – no matter how well-designed, how large or how ambitious a study, there is always some randomness involved, whether it be in the patient selection process, measurement of the endpoint or a hundred other factors. Most trials are therefore designed to 80% power (20% risk of failure by chance alone) or 90% power (10% risk). Few trials are intentionally sized beyond 90% power because of the increase in size needed – roughly speaking, compared to an identical study powered to 80%, a study powered to 90% requires one-third more subjects and 95% power requires two-thirds more.

Even more important to sample size calculations than the desired power is the magnitude of the treatment effect, which cannot be known precisely in advance of running the trial. This fact is the great conundrum of statistical power analysis – in the study design phase we must guess at the size of the treatment effect yet to be observed to properly size a trial whose objective is to better quantify and understand that very effect. If this sounds like circular reasoning, it is really the feedback loop again: earlier studies, including studies by different sponsors or of different therapies, inform decisions about later studies.

In the case of sample size and power calculations, though, inaccuracy in estimates of effect size can be particularly costly. As a rule of thumb, the sample size of a trial is inversely related to the square of the treatment effect – a therapy only half as effective requires four times as many subjects in a trial to evaluate its effectiveness.

It goes without saying that the difference between a 1,000-patient trial and a 4,000-patient trial is material to planning, and few outcomes are more frustrating than a nearly-but-not-quite statistically significant finding for a primary research hypothesis in a late-stage trial. The best cure for this malady is more data prior to the design of a late-stage study, or at least conservative estimates of effect size to avoid underpowered and hence potentially failed trials.

### Sensitivity and specificity

Studies of diagnostic devices or tests are distinct from therapeutic devices in that no condition is being treated – rather, the objective is to detect the condition with accuracy. In such studies, there are two data points of most interest: the truth about the patient’s status (the condition exists or it does not) and the tool’s diagnosis (again, the condition exists or it does not). These categories of truth and diagnosis form the basis for statistical analysis of the tool’s performance (10).

The same two types of statistical error are possible in diagnostic studies, but with different interpretations: Type I error now refers to detecting a condition when none exists while Type II error is failing to diagnose an existing condition. Accordingly, a different vocabulary is used when discussing trials of diagnostic devices. Let’s envision the combinations of truth and diagnosis like that in Table 1.

Table 1. In the table, all four possible combinations of truth and diagnosis are represented for a single use of the diagnostic tool (that is, diagnosis of a single patient). Naturally, the objective of a sound diagnostic tool is to be correct the great majority of the time. But the ramifications of being wrong can be very different depending on which kind of wrong it is. The upper-right box in the table is the failure to diagnose an existing condition; in the language of diagnostic trials, this is called a lack of sensitivity. The lower-left box is diagnosing a condition when none is present; this is a lack of specificity. And although a desirable diagnostic tool has both high sensitivity and high specificity, in practice both cannot always be achieved and one must sometimes be sacrificed for the other.

## D. Common Study Designs

### Single-arm studies

The simplest study design in translational research is the observational cohort study or single-arm study (11). This is a study where a group of subjects or observational units are observed under natural conditions. Such a design consists of applying a treatment to a group of subjects and observing what happens post-treatment.

The desired interpretation is that the changes observed in the wake of the treatment are due to the treatment itself, but without a randomized control group this conclusion requires ruling out other sources of change in the patient’s status. For example, improvements may also be due to effects such as regression to the mean, concomitant treatments, or the Hawthorne effect (i.e., the tendency for patients’ condition to improve simply because they are being observed within the context of a clinical trial, independent of any treatment effect). Without a control group that is subject to the same influences, it is often difficult to distinguish treatment effect from confounding factors such as these.

Even so, studying a large group of observational units under natural conditions may lead to a large body of objective data that can provide a reasonable degree of scientific knowledge. The degree of rigor can be enhanced with a few simple measures; pre-defined and objective measurements, studying a relatively homogenous group of subjects under relatively homogenous conditions, and so forth.

In cases where a large portion of the disease process and treatment mechanism are well understood, single-arm studies may provide an appropriate degree of rigor and spare some of the difficulties of more complicated designs. While observational cohort studies can often be improved upon, they are often the source of new ideas and hypotheses that form the foundation of translational research. Early-phase studies are often observational since the objective is to gather basic information about the novel therapy, and the question of how it compares to established treatments that might be used in a control group is for later studies.

Statistical analysis of single-arm clinical trials is usually simpler than that of randomized trials, since there is only one group of results to manage. In this situation, the primary purpose of analysis is sometimes just to summarize the data in an effective way, which can be done with the usual tools of the statistician – means, medians, frequency distributions, and so forth.

Often, however, some criterion for success must be defined, a benchmark against which the study results are to be measured. If 60% of subjects in a given trial are cured of their disease, for example, is that a satisfactory outcome or not? If 85% of subjects survive over the course of study follow-up, is that sufficient evidence of treatment benefit to proceed with further investigation? The answers depend upon the clinical context within which the study is conducted.

Without a control group, the benchmark usually must come from historical evidence or from comparable data on accepted therapies. In statistical terms, this standard is referred to as a performance goal. It is the rate of success that must be met or exceeded (or the rate of adverse outcomes which must be avoided), evaluated in a formal statistical manner, to declare overall success of the investigation. When a performance goal has been broadly negotiated with regulatory authorities and is applied to multiple trials in the same therapeutic area, the term objective performance criterion is often used instead.

In either case, if statistical analysis demonstrates that the threshold set by the performance goal or objective performance criterion has been exceeded, the study is simply said to have met the goal.

### Randomized studies: the two-arm trial

In a randomized study, experimental units (typically study subjects in our case) are assigned to one therapy or another by chance, and not, for instance, by the choice of the experimenter. If this is executed properly, any observed difference between the two randomly assigned groups is due to either the therapy or due to chance, and not confounding factors.

The unique value of randomization is that on average, every possible factor that may influence the treatment will be equally distributed between groups. This includes both measured and unmeasured factors, so for example two randomized groups should be similar with respect to the age, sex, health status, and even genetic makeup of the participants. Statistical tests and the p-values they produce are then used to help rule out the role of chance, leaving the therapy as the most plausible explanation for observed differences in outcomes between groups.

The most common type of randomized trial is the parallel-groups design, in which subjects are enrolled in one arm of the study and remain there for its duration. Most often one of the groups is a control therapy, such as the existing standard of care for the condition being treated. Other alternatives for controls include a placebo control, “no treatment” or an observational control. Use of a control group gives rise to the previously-mentioned gold standard of clinical investigation, the RCT (12). Many such studies have only two arms, the control and the investigational therapy, but three or more arms are not uncommon, especially in pharmaceutical studies where dose finding is in play.

For a parallel-group, two-arm RCT, statistical analysis most often focuses on comparing outcomes between randomized groups. The p-value is of central interest, being the statistical tool that distinguishes treatment effect from the play of chance. Most trials of this type are superiority designs, which have the research objective of demonstrating that the novel therapy is more beneficial (or safer, or both) than the control.

Other types of statistical inference, including non-inferiority or equivalence hypotheses, are also seen. Under non-inferiority, we do not need to show that the novel therapy is superior to the control, only that it is at least as good in terms of its benefits and risks. This study design is common when the control is widely accepted, especially when the novel device or drug is similar in its mechanism of action. In pharmaceutical studies, this may be a new drug in the same class as existing, approved treatments, while for medical devices it may be a modification of an existing design or a similar technology offered by a different manufacturer.

Since statistical analysis can never prove that two therapies are exactly alike in terms of their outcomes, non-inferiority is actually testing the hypothesis that the novel therapy is not “meaningfully” worse than the control. What is meant by “meaningfully” is a topic for clinical discussion, although standard non-inferiority margins can be used in many circumstances where the control therapy and the clinical outcomes are well understood.

Equivalence designs, in which the new therapy must be no different from the control in either direction (better or worse), are similar to non-inferiority except that the statistical test looks at both directions of potential difference. They are uncommon in studies evaluating clinical benefit and risk but are often seen in evaluations of biocompatibility, in which the study objective is to show that the systemic impact or uptake of the new therapy matches existing ones.

### Variations on the randomized design

Other variations on randomized studies include cross-over studies, factorial designs, and randomized withdrawal studies. Cross-over designs permit subjects to experience more than one therapy during the course of the study, and therefore permit each subject to serve as his or her own control (13). For example, in a two-period crossover design of two treatments, A and B, subjects are randomized to receive either treatment A followed by treatment B, or treatment B followed by treatment A.

The differences in outcome between treatments A and B are assessed, taking into account the fact that each subject receives each treatment. Since each subject can serve as his or her own control, this typically results in greater statistical power and consequently smaller sample sizes than a comparable parallel-groups design. However, cross-over studies are more complicated to design and execute. They may suffer from carry-over effect, in which the first treatment delivered has an impact on the subject’s response to the next therapy. This is especially true in drug studies, since pharmaceutical interventions normally have a systemic effect that may last past the removal of the therapy.

Factorial designs evaluate multiple interventions simultaneously in various combinations, and as a result can involve several study arms. These designs are common in the social sciences and in agriculture (much of Fisher’s pioneering work was in this area, and statistical design language such as “split-plot analysis” derives from it), but less so in clinical trials. If there are two treatments under investigation, again A and B, and each treatment has a corresponding control group, randomization may proceed by a dual 2×2 fashion so that there are four possible groups to which a subject can be assigned.

Occasionally, randomization is not possible due to practical or ethical reasons. In such cases, it may still be desirable to have a well-defined control against which to compare study results. Non-randomized controls may be found by recruiting subjects from the same pool as treated subjects without randomization, from historical data, or from the scientific literature.

## E. Phases of Translational Research and Study Design

### Overview

Tables 2 and 3 show study designs that may be applied at different phases of the translational research process. The choice of design may interact with the phase; more or less complicated designs may be applied to address more or less complicated questions, in conjunction the requirements of time and money. The following section outlines the phases of translational research studies.

Table 2. Typical Scheme for Pharmaceutical Studies

Table 3. Typical Scheme for Medical Device Studies

### Early and mid-phase studies

#### Proof of concept

Early-stage studies may be relatively simple and flexible, often based on an observational cohort. This phase of the discovery process may help inform and refine treatments. For example, early prototypes of a medical device that initially are unsuccessful may be fine-tuned. With too much rigor from a study design or statistical perspective, the chance to move promising technologies forward would be limited.

Thus there may be interest in general mechanisms or simply demonstrating a small possibility a candidate treatment might be effective. In such cases, concerns about type I error or bias may be secondary, as these issues would be addressed in subsequent studies. Nonetheless, these issues should not be entirely ignored but rather integrated into the overall research program and long term plans.

#### Safety studies

Any study of humans or animals must be assessed for safety risks, a universal principle that permeates all clinical study design and execution. While novel therapies may have a high probability of incurring adverse effects, there are cases where this high risk is appropriate given the severity of the disease and lack of treatment options. Again, observational cohort designs are most frequently used in safety studies, both at early and late stages. For example, preliminary safety studies and long-term surveillance studies conducted after regulatory approval both address safety. Both may therefore be based on single-arm study designs.

Safety data from such studies may be analyzed in terms of simple frequency counts and incidence rates. Corresponding hypothesis tests may be of value in understanding the role of chance and the potential for higher (or lower) safety risks in subsequent studies.

To illustrate, an event that occurs once in a study of 100 subjects has an incidence of 1% with an associated 95% confidence interval spanning from 0% to 5.4%. Alternatively, an event that occurs at the same rate (1%) in a study of 1,000 subjects has an associated confidence interval spanning from 0% to 1.8%; the larger study provides a more precise estimate of the event occurrence. Equally important, a larger study has a greater chance of observing a rare event (14).

As the size of studies increase along the bench-to-bedside route, the larger amount of data better informs the safety risks. This means that only the most common safety risks may be measurable early on and that very rare safety risks might remain unknown until extremely large studies are undertaken. This may be taken into consideration when establishing the inclusion/exclusion criteria for studies and the appropriate labeling and indications for new therapies. In early development when high risks are still possible, it may be judicious to limit enrollment to subjects expected to receive the most benefit.

As more is known and risks are better quantified, eligibility criteria may be broadened to better reflect the population determined to derive benefit. For regulators, a key component of drug or device approval is to correctly identify the intended population for use. The labeling for an approved product typically closely reflects the subjects who were studied in late-stage clinical trials of the therapy.

#### Exploratory studies (Phase I and II for pharmaceuticals, pre-clinical/pilot/feasibility for device)

Phase I studies for pharmaceutical products are characterized by the need to understand the tolerable doses of the medication. These studies may be run in healthy human volunteers to understand at what dose there is still reasonable expectation for safety given the pharmacokinetics/dynamics and the need to maximize effectiveness.

The parallel exploratory study for medical devices is often a pre-clinical study. If the device is fixed in its mechanism of action, there may be no corresponding concept of dose. A heart valve, for example, is mechanical in its action and needs no fine-tuning with regards to its performance besides correct sizing of the device to anatomy. For devices where the concept of dose does apply, such as neurostimulation devices whose electrical therapy can be varied in intensity, these settings may be initially explored through bench studies and established ranges for stimulation parameters, pre-clinical efficacy studies, or ongoing adjustments on human clinical recipients.

The ad hoc nature of such adjustment is an illustration of the need and potential benefits of translational research; a back-and-forth movement from bedside treatment to therapeutic mechanisms is important to understand and can potentially improve patient outcomes.

Phase II pharmaceutical studies, which may be single-arm or randomized designs, extend the work of earlier trials to an initial evaluation of safety and/or effectiveness in a larger, better-defined cohort. These trials serve as gatekeepers for the larger and more costly Phase III studies in which the clinical utility of the therapy is determined. Note that there is no direct analog to the pharmaceutical Phase II study in medical devices, since dose ranging is typically not at issue and the localized mechanical effect of the device compresses the process of evaluating biological effect as distinct from clinical benefit.

Early study design is often an exercise in educated guessing, and we have already seen the potential statistical consequences of approaching later study design without sufficient information. Consequently, one ad hoc piece of advice that applies to mid-phase studies (but every other phase as well) is “run the largest study you can afford.”

### Confirmatory studies (Pivotal device trials, Phase III pharmaceutical trials)

Confirmatory trials are generally the most complicated and challenging clinical studies. Confirmatory trials involve a refined therapy, a focused target population of interest, a well-defined research question and a corresponding study hypothesis. Establishing each of these aspects may be multi-year projects, though rarely do researchers have such luxury of time. More frequently, informed decisions on all these aspects are made based on a combination of scientific judgment, historical data, information from earlier studies, and the particular risk/benefit tolerance of the study sponsor.

With regards to the latter, for a small startup company with one technology under investigation, the entire future of the company and technology may hinge on the result of a single study and so risk tolerance may be low. On the other hand, it is precisely these situations in which time and money are often the most precious, increasing the need to take risk. Each of these aspects informs the study design and sample size. A smaller trial may mitigate risks with regards to time and expense, but will be riskier than a larger trial in terms of definitively answering the study question.

Confirmatory studies are usually randomized. The benefits of a randomized study over a non-randomized study in terms of scientific rigor are clear. Random assignment of experimental units to treatments prevents the introduction of substantial bias into a study. Investigators cannot systematically select high (or low) risk subjects for one treatment or another according to their preexisting biases.

The random assignment tends to create groups that are similar in every possible way, both on measured and unmeasured variables, so that the only difference in outcome observed between randomized group is due to chance or the process of randomization and the treatment assignment. As we run studies where we can rule out the role of chance with statistical methods, the combination of a randomized study with a rigorous statistical analysis provides us insights into causal relationships.

Nevertheless, particularly in the device world, nonrandomized confirmatory studies (in this case called pivotal trials) are sometimes feasible. Such situations include cases where the acceptable level of performance of the therapy has already been defined, as is the case with many artificial heart valves, or when randomized is impractical due to the lack of a standard of care against which the new therapy can be compared.

Pivotal studies in which the control group is truly no intervention at all (often referred to as watchful waiting) are unusual due to the ethical concern of enrolling patients in a clinical trial in order to do nothing except observe and report when a promising therapy is available for investigation. This notion of an unmet clinical need drives the design of many trials.

### Long-term and post-market studies

Post-approval studies may be required by regulatory agencies to further define and understand questions about new therapies. There questions are generally not those that require an answer prior to approval, but ones that may later inform the labeling and use of the therapy, perhaps under different conditions or in select subgroups of patients.

These may also be established to help understand safety aspects of new therapies, where the original studies used to support market approval were of insufficient sample size. For example, for extremely rare adverse events, it may only be after the product is available to thousands of subjects that even one or two events are observed. Surveillance studies may be established to more precisely understand and quantify these types of risks.

It is increasingly the expectation of regulatory agencies, particularly the U.S. Food and Drug Administration, that post-approval studies be designed with a high level of statistical rigor. Formal statistical hypotheses addressing questions of interest such as the rate of occurrence of sentinel safety events or confirmation of efficacy signals in either primary of secondary endpoints, are becoming the norm in many therapeutic areas. With this heightened level of statistical evaluation is associated formal powering of such studies, with sample sizes being driven by statistical concerns rather than precedent or informal guidelines in some cases.

Additionally, sponsors may run post-market studies to examine unanswered questions, though these are generally not a regulatory requirement. These may be design and run to support future publications or to help inform future treatments or studies in new areas. They may similarly be a starting point for expansions of the indications and labeling of the therapy, seeking out patient populations not studied in Phase III pharmaceutical or pivotal device trials.

## F. Summary

This chapter focuses on the clinical, regulatory and statistical perspectives of study design. Our objective has been to open the door to further investigation of this critical scientific topic; a more detailed consideration of many of the subtleties is beyond our current scope.

As it continues to evolve and expand, translational research will benefit from greater collaboration between historically separate areas of science as well as improved understanding of the needs of those it seeks to serve. A more comprehensive understanding of study design is one facet of this evolution toward more productive research and ultimately, better patient outcomes.

## References

- Sarton G. Galen of Pergamon. Lawrence KS: University of Kansas Press, 1954.
- Box JF. R.A. Fisher: The life of a Scientist. New York: Wiley, 1978.
- Casella G, Berger R. Statistical Inference. 2nd ed: Brooks/Cole, 2001.
- Montgomery D. Design and analysis of experiments. 8th ed. New York: John Wiley and Sons, 2013.
- Fisher R. The Design of Experiments. 9th edition ed: Macmillian, 1971.
- Ioannidis JP. Why most published research findings are false. PLoS medicine 2005;2:e124.
- Miller R. Simultaneous statistical Inference. New York: Springer, 1981.
- Chow S-C, Chang M. Adaptive Design Methods in Clinical Trials

: Chapman and Hall/CRC, 2011. - Gelman A, JB C, Stern H, Rubin D. Bayesian Data Analysis. 2nd ed: Chapman and Hall, CRC, 2004.
- Agresti A. Categorial Data Analysis. New York: Wiley, 1990.
- Rosenbaum P. Design of Observational Studies. New York: Springer, 2010.
- Schulz KF, Altman DG, Moher D, Group C. CONSORT 2010 statement: updated guidelines for reporting parallel group randomised trials. Bmj 2010;340:c332.
- Jones B, Kenward M. Design and Analysis of Cross-Over Trials London: Chapman and Hall, 2003.
- Tsang R, Colley L, Lynd LD. Inadequate statistical power to detect clinically significant differences in adverse event rates in randomized controlled trials. Journal of clinical epidemiology 2009;62:609-16.