Field experiments, explained
Editor’s note: This is part of a series called “The Day Tomorrow Began,” which explores the history of breakthroughs at UChicago. Learn more here.
A field experiment is a research method that uses some controlled elements of traditional lab experiments, but takes place in natural, real-world settings. This type of experiment can help scientists explore questions like: Why do people vote the way they do? Why do schools fail? Why are certain people hired less often or paid less money?
University of Chicago economists were early pioneers in the modern use of field experiments and conducted innovative research that impacts our everyday lives—from policymaking to marketing to farming and agriculture.
Jump to a section:
What is a field experiment, why do a field experiment, what are examples of field experiments, when did field experiments become popular in modern economics, what are criticisms of field experiments.
Field experiments bridge the highly controlled lab environment and the messy real world. Social scientists have taken inspiration from traditional medical or physical science lab experiments. In a typical drug trial, for instance, participants are randomly assigned into two groups. The control group gets the placebo—a pill that has no effect. The treatment group will receive the new pill. The scientist can then compare the outcomes for each group.
A field experiment works similarly, just in the setting of real life.
It can be difficult to understand why a person chooses to buy one product over another or how effective a policy is when dozens of variables affect the choices we make each day. “That type of thinking, for centuries, caused economists to believe you can't do field experimentation in economics because the market is really messy,” said Prof. John List, a UChicago economist who has used field experiments to study everything from how people use Uber and Lyft to how to close the achievement gap in Chicago-area schools . “There are a lot of things that are simultaneously moving.”
The key to cleaning up the mess is randomization —or assigning participants randomly to either the control group or the treatment group. “The beauty of randomization is that each group has the same amount of bad stuff, or noise or dirt,” List said. “That gets differenced out if you have large enough samples.”
Though lab experiments are still common in the social sciences, field experiments are now often used by psychologists, sociologists and political scientists. They’ve also become an essential tool in the economist’s toolbox.
Some issues are too big and too complex to study in a lab or on paper—that’s where field experiments come in.
In a laboratory setting, a researcher wants to control as many variables as possible. These experiments are excellent for testing new medications or measuring brain functions, but they aren’t always great for answering complex questions about attitudes or behavior.
Labs are highly artificial with relatively small sample sizes—it’s difficult to know if results will still apply in the real world. Also, people are aware they are being observed in a lab, which can alter their behavior. This phenomenon, sometimes called the Hawthorne effect, can affect results.
Traditional economics often uses theories or existing data to analyze problems. But, when a researcher wants to study if a policy will be effective or not, field experiments are a useful way to look at how results may play out in real life.
In 2019, UChicago economist Michael Kremer (then at Harvard) was awarded the Nobel Prize alongside Abhijit Banerjee and Esther Duflo of MIT for their groundbreaking work using field experiments to help reduce poverty . In the 1990s and 2000s, Kremer conducted several randomized controlled trials in Kenyan schools testing potential interventions to improve student performance.
In the 1990s, Kremer worked alongside an NGO to figure out if buying students new textbooks made a difference in academic performance. Half the schools got new textbooks; the other half didn’t. The results were unexpected—textbooks had no impact.
“Things we think are common sense, sometimes they turn out to be right, sometimes they turn out to be wrong,” said Kremer on an episode of the Big Brains podcast. “And things that we thought would have minimal impact or no impact turn out to have a big impact.”
In the early 2000s, Kremer returned to Kenya to study a school-based deworming program. He and a colleague found that providing deworming pills to all students reduced absenteeism by more than 25%. After the study, the program was scaled nationwide by the Kenyan government. From there it was picked up by multiple Indian states—and then by the Indian national government.
“Experiments are a way to get at causal impact, but they’re also much more than that,” Kremer said in his Nobel Prize lecture . “They give the researcher a richer sense of context, promote broader collaboration and address specific practical problems.”
Among many other things, field experiments can be used to:
Study bias and discrimination
A 2004 study published by UChicago economists Marianne Bertrand and Sendhil Mullainathan (then at MIT) examined racial discrimination in the labor market. They sent over 5,000 resumes to real job ads in Chicago and Boston. The resumes were exactly the same in all ways but one—the name at the top. Half the resumes bore white-sounding names like Emily Walsh or Greg Baker. The other half sported African American names like Lakisha Washington or Jamal Jones. The study found that applications with white-sounding names were 50% more likely to receive a callback.
Examine voting behavior
Political scientist Harold Gosnell , PhD 1922, pioneered the use of field experiments to examine voting behavior while at UChicago in the 1920s and ‘30s. In his study “Getting out the vote,” Gosnell sorted 6,000 Chicagoans across 12 districts into groups. One group received voter registration info for the 1924 presidential election and the control group did not. Voter registration jumped substantially among those who received the informational notices. Not only did the study prove that get-out-the-vote mailings could have a substantial effect on voter turnout, but also that field experiments were an effective tool in political science.
Test ways to reduce crime and shape public policy
Researchers at UChicago’s Crime Lab use field experiments to gather data on crime as well as policies and programs meant to reduce it. For example, Crime Lab director and economist Jens Ludwig co-authored a 2015 study on the effectiveness of the school mentoring program Becoming a Man . Developed by the non-profit Youth Guidance, Becoming a Man focuses on guiding male students between 7th and 12th grade to help boost school engagement and reduce arrests. In two field experiments, the Crime Lab found that while students participated in the program, total arrests were reduced by 28–35%, violent-crime arrests went down by 45–50% and graduation rates increased by 12–19%.
The earliest field experiments took place—literally—in fields. Starting in the 1800s, European farmers began experimenting with fertilizers to see how they affected crop yields. In the 1920s, two statisticians, Jerzy Neyman and Ronald Fisher, were tasked with assisting with these agricultural experiments. They are credited with identifying randomization as a key element of the method—making sure each plot had the same chance of being treated as the next.
The earliest large-scale field experiments in the U.S. took place in the late 1960s to help evaluate various government programs. Typically, these experiments were used to test minor changes to things like electricity pricing or unemployment programs.
Though field experiments were used in some capacity throughout the 20th century, this method didn’t truly gain popularity in economics until the 2000s. Kremer and List were early pioneers and first began experimenting with the method in the 1990s.
In 2004, List co-authored a seminal paper defining field experiments and arguing for the importance of the method. In 2008, he and UChicago economist Steven Levitt published another study tracing the history of field experiments and their impact on economics.
In the past few decades, the use of field experiments has exploded. Today, economists often work alongside NGOs or nonprofit organizations to study the efficacy of programs or policies. They also partner with companies to test products and understand how people use services.
There are several ethical discussions happening among scholars as field experiments grow in popularity. Chief among them is the issue of informed consent. All studies that involve human test subjects must be approved by an institutional review board (IRB) to ensure that people are protected.
However, participants in field experiments often don’t know they are in an experiment. While an experiment may be given the stamp of approval in the research community, some argue that taking away peoples’ ability to opt out is inherently unethical. Others advocate for stricter review processes as field experiments continue to evolve.
According to List, another major issue in field experiments is the issue of scale . Many experiments only test small groups—say, dozens to hundreds of people. This may mean the results are not applicable to broader situations. For example, if a scientist runs an experiment at one school and finds their method works there, does that mean it will also work for an entire city? Or an entire country?
List believes that in addition to testing option A and option B, researchers need a third option that accounts for the limitations that come with a larger scale. “Option C is what I call critical scale features. I want you to bring in all of the warts, all of the constraints, whether they're regulatory constraints, or constraints by law,” List said. “Option C is like your reality test, or what I call policy-based evidence.”
This problem isn’t unique to field experiments, but List believes tackling the issue of scale is the next major frontier for a new generation of economists.
Hero photo copyright Shutterstock.com
More Explainers
Dark energy, explained
Improv, Explained
Get more with UChicago News delivered to your inbox.
Recommended Stories
An economist illuminates our giving habits—during the pandemic and…
Collaborating with Kenyan government on development innovations is…
Related Topics
Latest news, president paul alivisatos honored with enrico fermi presidential award.
Big Brains podcast
Big Brains podcast: What’s the truth about alcohol’s benefits and risks?
UChicago Class Visits
UChicago students engage their senses outside the classroom
Center in Delhi
A decade of impact: UChicago Center in Delhi celebrates 10th anniversary
Go 'Inside the Lab' at UChicago
Explore labs through videos and Q&As with UChicago faculty, staff and students
Recommended Reading
What to read and watch over winter break 2024
Climate Change
COP29 provides UChicago students with unique perspective on climate change and policy
Around uchicago.
UChicago Prof. James A. Robinson and alum John Jumper awarded Nobel Prizes
Partnerships
UChicago to collaborate with IBM, Illinois on new National Quantum Algorithm Ce…
Uchicago celebrates opening of john w. boyer center in paris.
New Institute
UChicago launches groundbreaking new institute to confront climate change
Nobel Prize
Nobel laureate finds his ‘calling’ in studying the world
How an ‘accidental chemist’ honed his approach at UChicago on the way to a Nobel Prize
Film History
“Throughout my time at UChicago, I’ve sought to provide opportunities to share scholarship with the public”
Scholarship
UChicago alum named 2025 Marshall Scholar
The D-Lab is closed for Winter Break . We will not have a virtual frontdesk, workshops, or consultations. We will return on Monday, January 27.
Introduction to Field Experiments and Randomized Controlled Trials
Have you ever been curious about the methods researchers employ to determine causal relationships among various factors, ultimately leading to significant breakthroughs and progress in numerous fields? In this article, we offer an overview of field experimentation and its importance in discerning cause and effect relationships. We outline how randomized experiments represent an unbiased method for determining what works. Furthermore, we discuss key aspects of experiments, such as intervention, excludability, and non-interference. To illustrate these concepts, we present a hypothetical example of a randomized controlled trial evaluating the efficacy of an experimental drug called Covi-Mapp.
Why experiments?
Every day, we find ourselves faced with questions of cause and effect. Understanding the driving forces behind outcomes is crucial, ranging from personal decisions like parenting strategies to organizational challenges such as effective advertising. This blog aims to provide a systematic introduction to experimentation, igniting enthusiasm for primary research and highlighting the myriad of experimental applications and opportunities available.
The challenge for those who seek to answer causal questions convincingly is to develop a research methodology that doesn't require identifying or measuring all potential confounders. Since no planned design can eliminate every possible systematic difference between treatment and control groups, random assignment emerges as a powerful tool for minimizing bias. In the contentious world of causal claims, randomized experiments represent an unbiased method for determining what works. Random assignment means participants are assigned to different groups or conditions in a study purely by chance. Basically, each participant has an equal chance to be assigned to a control group or a treatment group.
Field experiments, or randomized studies conducted in real-world settings, can take many forms. While experiments on college campuses are often considered lab studies, certain experiments on campus – such as those examining club participation – may be regarded as field experiments, depending on the experimental design. Ultimately, whether a study is considered a field experiment hinges on the definition of "the field."
Researchers may employ two main scenarios for randomization. The first involves gathering study participants and randomizing them at the time of the experiment. The second capitalizes on naturally occurring randomizations, such as the Vietnam draft lottery.
Intervention, Excludability, and Non-Interference
Three essential features of any experiment are intervention, excludability, and non-interference. In a general sense, the intervention refers to the treatment or action being tested in an experiment. The excludability principle is satisfied when the only difference between the experimental and control groups is the presence or absence of the intervention. The non-interference principle holds when the outcome of one participant in the study does not influence the outcomes of other participants. Together, these principles ensure that the experiment is designed to provide unbiased and reliable results, isolating the causal effect of the intervention under study.
Omitted Variables and Non-Compliance
To ensure unbiased results, researchers must randomize as much as possible to minimize omitted variable bias. Omitted variables are factors that influence the outcome but are not measured or are difficult to measure. These unmeasured attributes, sometimes called confounding variables or unobserved heterogeneity, must be accounted for to guarantee accurate findings.
Non-compliance can also complicate experiments. One-sided non-compliance occurs when individuals assigned to a treatment group don't receive the treatment (failure to treat), while two-sided non-compliance occurs when some subjects assigned to the treatment group go untreated or individuals assigned to the control group receive the treatment. Addressing these issues at the design level by implementing a blind or double-blind study can help mitigate potential biases.
Achieving Precision through Covariate Balance
To ensure the control and treatment groups are comparatively similar in all relevant aspects, particularly when the sample size (n) is small, it is essential to achieve covariate balance. Covariance measures the association between two variables, while a covariate is a factor that influences the outcome variable. By balancing covariates, we can more accurately isolate the effects of the treatment, leading to improved precision in our findings.
Fictional Example of Randomized Controlled Trial of Covi-Mapp for COVID-19 Management
Let's explore a fictional example to better understand experiments: a one-week randomized controlled trial of the experimental drug Covi-Mapp for managing Covid. In this case, the control group receives the standard care for Covid patients, while the treatment group receives the standard care plus Covi-Mapp. The outcome of interest is whether patients have cough symptoms on day 7, as subsidizing cough symptoms is an encouraging sign in Covid recovery. We'll measure the presence of cough on day 0 and day 7, as well as temperature on day 0 and day 7. Gender is also tracked. The control represents the standard care for COVID-19 patients, while the treatment includes standard care plus the experimental drug.
In this Covi-Mapp example, the intervention is the Covi-Mapp drug, the excludability principle is satisfied if the only difference in patient care between the groups is the drug administration, and the non-interference principle holds if one patient's outcome doesn't affect another's.
First, let's assume we have a dataset containing the relevant information for each patient, including cough status on day 0 and day 7, temperature on day 0 and day 7, treatment assignment, and gender. We'll read the data and explore the dataset:
Simple treatment effect of the experimental drug
Without any covariates, let's first look at the estimated effect of the treatment on the presence of cough on day 7. The estimated proportion of patients with a cough on day 7 for the control group (not receiving the experimental drug) is 0.847458. In other words, about 84.7% of patients in the control group are expected to have a cough on day 7, all else being equal. The estimated effect of the experimental drug on the presence of cough on day 7 is -0.23. This means that, on average, receiving the experimental drug reduces the proportion of patients with a cough on day 7 by 23.8% compared to the control group.
We know that a patient's initial condition would affect the final outcome. If the patient has a cough and a fever on day 0, they might not fare well with the treatment. To better understand the treatment's effect, let's add these covariates:
The output shows the results of a linear regression model, estimating the effect of the experimental drug (treat_covid_mapp) on the presence of cough on day 7, adjusting for cough on day 0 and temperature on day 0. The experimental drug significantly reduces the presence of cough on day 7 by approximately 16.6% compared to the control group (p-value = 0.046242). The presence of cough on day 0 does not significantly predict the presence of cough on day 7 (p-value = 0.717689). A one-unit increase in temperature on day 0 is associated with a 20.6% increase in the presence of cough on day 7, and this effect is statistically significant (p-value = 0.009859).
Should we add day 7 temperature as a covariate? By including it, we might find that the treatment is no longer statistically significant since the temperature on day 7 could be affected by the treatment itself. It is a post-treatment variable, and by including it, the experiment loses value as we used something that was affected by intervention as our covariate.
However, we'd like to investigate if the treatment affects men or women differently. Since we collected gender as part of the study, we could check for Heterogeneous Treatment Effect (HTE) for male vs. female. The experimental drug has a marginally significant effect on the outcome variable for females, reducing it by approximately 23.1% (p-value = 0.05391).
Which group, those coded as male == 0 or male == 1, have better health outcomes (cough) in control? What about in treatment? How does this help to contextualize any heterogeneous treatment effect that might have been estimated?
Stargazer is a popular R package that enables users to create well-formatted tables and reports for statistical analysis results.
Looking at this regression report, we see that males in control have a temperature of 102; females in control have a temperature of 98.6 (which is very nearly a normal temperature). So, in control, males are worse off. In treatment, males have a temperature of 102 - 2.59 = 99.41. While this is closer to a normal temperature, this is still elevated. Females in treatment have a temperature of 98.5 - .32 = 98.18, which is slightly lower than a normal temperature, and is better than an elevated temperature. It appears that the treatment is able to have a stronger effect among male participants than females because males are *more sick* at baseline.
In conclusion, experimentation offers a fascinating and valuable avenue for primary research, allowing us to address causal questions and enhance our understanding of the world around us. Covariate control helps to isolate the causal effect of the treatment on the outcome variable, ensuring that the observed effect is not driven by confounding factors. Proper control of covariates enhances the internal validity of the study and ensures that the estimated treatment effect is an accurate representation of the true causal relationship. By exploring and accounting for sub groups in data, researchers can identify whether the treatment has different effects on different groups, such as men and women or younger and older individuals. This information can be critical for making informed policy decisions and developing targeted interventions that maximize the benefits for specific groups. The ongoing investigation of experimental methodologies and their potential applications represents a compelling and significant area of inquiry.
Gerber, A. S., & Green, D. P. (2012). Field Experiments: Design, Analysis, and Interpretation . W. W. Norton.
“DALL·E 2.” OpenAI , https://openai.com/product/dall-e-2
“Data Science 241. Experiments and Causal Inference.” UC Berkeley School of Information , https://www.ischool.berkeley.edu/courses/datasci/241
No internet connection.
All search filters on the page have been cleared., your search has been saved..
- Sign in to my profile My Profile
Subject index
The fourth book in The SAGE Quantitative Research Kit, this resource covers the basics of designing and conducting basic experiments, outlining the various types of experimental designs available to researchers, while providing step-by-step guidance on how to conduct your own experiment. As well as an in-depth discussion of Random Controlled Trials (RCTs), this text highlights effective alternatives to this method and includes practical steps on how to successfully adopt them. Topics include: • The advantages of randomisation • How to avoid common design pitfalls that reduce the validity of experiments • How to maintain controlled settings and pilot tests • How to conduct quasi-experiments when RCTs are not an option Practical and succintly written, this book will give you the know-how and confidence needed to succeed on your quantitative research journey.
Introduction
- By: Barak Ariel , Matthew Bland & Alex Sutherland
- In: Experimental Designs
- Chapter DOI: https:// doi. org/10.4135/9781529682779.n1
- Subject: Sociology , Criminology and Criminal Justice , Business and Management , Communication and Media Studies , Education , Psychology , Health , Social Work , Political Science and International Relations
- Show page numbers Hide page numbers
[Page 2] Formal textbooks on experiments first surfaced more than a century ago, and thousands have emerged since then. In the field of education, William McCall published How to Experiment in Education in 1923; R.A. Fisher, a Cambridge scholar, released Statistical Methods for Research Workers and The Design of Experiments in 1925 and 1935, respectively; S.S. Stevens circulated his Handbook of Experimental Psychology in 1951. We also have D.T. Campbell and Stanley’s (1963) classic Experimental and Quasi-Experimental Designs for Research , and primers like Shadish et al.’s (2002) Experimental and Quasi-Experimental Designs for Generalised Causal Inference , which has been cited nearly 50,000 times. These foundational texts provide straightforward models for using experiments in causal research within the social sciences.
Fundamentally, this corpus of knowledge shares a common long-standing methodological theme: when researchers want to attribute causal inferences between interventions and outcomes, they need to conduct experiments. The basic model for demonstrating cause-and-effect relationships relies on a formal, scientific process of hypothesis testing, and this process is confirmed through the experimental design. One of these fundamental processes dictates that causal inference necessarily requires a comparison . A valid test of any intervention involves a situation through which the treated group (or units) can be compared – what is termed a counterfactual . Put another way, evidence of ‘successful treatment’ is always relative to a world in which the treatment was not given (D.T. Campbell, 1969). Whether the treatment group is compared to itself prior to the exposure to the intervention, or a separate group of cases unexposed to the intervention, or even just some predefined criterion (like a national average or median), contrast is needed. While others might disagree (e.g. Pearl, 2019), without an objective comparison, we cannot talk about causation.
Causation theories are found in different schools of thought (for discussions, see Cartwright & Hardie, 2012; Pearl, 2019; Wikström, 2010). The dominant causal framework is that of ‘potential outcomes’ (or the Neyman–Rubin causal framework; Rubin, 2005), which we discuss herein and which many of the designs and examples in this book use as their basis. Until mainstream experimental disciplines revise the core foundations of the standard scientific inquiry, one must be cautious when recommending public policy based on alternative research designs. Methodologies based on subjective or other schools of thought about what causality means will not be discussed in this book. To emphasise, we do not discount these methodologies and their contribution to research, not least for developing logical hypotheses about the causal relationships in the universe. We are, however, concerned about risks to the validity of these causal claims and how well they might stand a chance of being implemented in practice. We discuss these issues in more detail in Chapter 4 . For further reading, see Abell and Engel (2019) as well as Abend et al. (2013).
[Page 3] However, not all comparisons can be evaluated equally. For the inference that a policy or change was ‘effective’, researchers need to be sure that the comparison group that was not exposed to the intervention resembles the group that was exposed to the intervention as much as possible. If the treatment group and the no-treatment group are incomparable – not ‘apples to apples’ – it then becomes very difficult to ‘single out’ the treatment effect from pre-existing differences. That is, if two groups differ before an intervention starts, how can we be sure that it was the introduction of the intervention and not the pre-existing differences that produce the result?
To have confidence in the conclusions we draw from studies that look at the causal relationship between interventions and their outcomes means having only one attributable difference between treatment and no-treatment conditions: the treatment itself. Failing this requirement suggests that any observed difference between the treatment and no-treatment groups can be attributed to other explanations. Rival hypotheses (and evidence) can then falsify – or confound – the hypothesis about the causal relationship. In other words, if the two groups are not comparable at baseline, then it can be reasonably argued that the outcome was caused by inherent differences between the two groups of participants , by discrete settings in which data on the two groups were collected, or through diverse ways in which eligible cases were recruited into the groups. Collectively, these plausible yet alternative explanations to the observed outcome, other than the treatment effect, undermine the test. Therefore, a reasonable degree of ‘pre-experimental comparability’ between the two groups is needed, or else the claim of causality becomes speculative. We spend a considerable amount of attention on this issue throughout the book, as all experimenters share this fundamental concern regarding equivalence.
Experiments are then split into two distinct approaches to achieve pre-experimental comparability: statistical designs and randomisation . Both aim to facilitate equitable conditions between treatment and control conditions but achieve this goal differently. Statistical designs, often referred to as quasi-experimental methods, rely on statistical analysis to control and create equivalence between the two groups. For example, in a study on the effect of police presence on crime in particular neighbourhoods, researchers can compare the crime data in ‘treatment neighbourhoods’ before and after patrols were conducted, and then compare the results with data from ‘control neighbourhoods’ that were not exposed to the patrols (e.g. Kelling et al., 1974; Sherman & Weisburd, 1995). Noticeable differences in the before–after comparisons would then be attributed to the police patrols. However, if there are also observable differences between the neighbourhoods or the populations who live in the treatment and the no-treatment neighbourhoods, or the types of crimes that take place in these neighbourhoods, we can use statistical controls to ‘rebalance’ the groups – or at least account for the differences between groups arising from these other variables. [Page 4] Through statistically controlling for these other variables (e.g. Piza & O’Hara, 2014; R.G. Santos & Santos, 2015; see also The SAGE Quantitative Research Kit , Volume 7), scholars could then match patrol and no-patrol areas and take into account the confounding effect of these other factors. In doing so, researchers are explicitly or implicitly saying ‘this is as good as randomisation’. But what does that mean in practice?
While on the one hand, we have statistical designs, on the other, we have experiments that use randomisation, which relies on the mathematical foundations of probability theory (as discussed in The SAGE Quantitative Research Kit , Volume 3). Probability theory postulates that through the process of randomly assigning cases into treatment and no-treatment conditions, experimenters have the best shot of achieving pre-experimental comparability between the two groups. This is owing to the law of large numbers (or ‘logic of science’ according to Jaynes, 2003). Allocating units at random does, with a large enough sample, create balanced groups. As we illustrate in Chapter 2 , this balance is not just apparent for observed variables (i.e. what we can measure) but also in terms of the unobserved factors that we cannot measure (cf. Cowen & Cartwright, 2019). For example, we can match treatment and comparison neighbourhoods in terms of crimes reported to the police before the intervention (patrols), and then create balance in terms of this variable (Saunders et al., 2015; see also Weisburd et al., 2018). However, we cannot create true balance between the two groups if we do not have data on un reported crimes, which may be very different in the two neighbourhoods.
We cannot use statistical controls where no data exist or where we do not measure something. The randomisation of units into treatment and control conditions largely mitigates this issue (Farrington, 2003a; Shadish et al., 2002; Weisburd, 2005). This quality makes, in the eyes of many, randomised experiments a superior approach to other designs when it comes to making causal claims (see the debates about ‘gold standard’ research in Saunders et al., 2016). Randomised experiments have what is called a high level of internal validity (see review in Grimshaw et al., 2000; Schweizer et al., 2016). What this means is that, when properly conducted, a randomised experiment gives one the greatest confidence levels that the effect(s) observed arose because of the cause (randomly) introduced by the experiment, and not due to something else.
The parallel phrase – external validity – means the extent to which the results from this experiment can apply elsewhere in the world. Lab-based randomised experiments typically have very high internal validity, but very low external validity, because their conditions are highly regulated and not replicable in a ‘real-world’ scenario. We review these issues in Chapter 3 .
Importantly, random allocation means that randomised experiments are prospective not retrospective – that is, testing forthcoming interventions, rather than ones that have already been administered where data have already been produced. Prospective studies allow researchers to maintain more control compared to retrospective studies. [Page 5] The researcher is involved in the very process of case selection, treatment fidelity (the extent to which a treatment is delivered or implemented as intended) and the data collated for the purposes of the experiment. Experimenters using random assignment are therefore involved in the distribution and management of units into different real-life conditions (e.g. police patrols) ex ante and not ex post . As the scholar collaborates with a treatment provider to jointly follow up on cases, and observe variations in the measures within the treatment and no-treatment conditions, they are in a much better position to provide assurance that the fidelity of the test is maintained throughout the process (Strang, 2012). These features rarely exist in quasi-experimental designs, but at the same time, randomised experiments require scientists to pay attention to maintaining the proper controls over the administration of the test. For this reason, running a randomised controlled trial (RCT) can be laborious.
In Chapter 5 , we cover an underutilised instrument – the experimental protocol – and illustrate the importance of conducting a pre-mortem analysis: designing and crafting the study before venturing out into the field. The experimental protocol requires the researcher to address ethical considerations: how we can secure the rights of the participants, while advancing scientific knowledge through interventions that might violate these rights. For example, in policing experiments where the participants are offenders or victims, they do not have the right to consent; the policing strategy applied in their case is predetermined, as offenders may be mandated by a court to attend a treatment for domestic violence. However, the allocation of the offenders into any specific treatment is conducted randomly (see Mills et al., 2019). Of course, if we know that a particular treatment yields better results than the comparison treatment (e.g. reduces rates of repeat offending compared to the rates of reoffending under control conditions), then there is no ethical justification for conducting the experiment. When we do not have evidence that supports the hypothesised benefit of the intervention, however, then it is unethical not to conduct an experiment. After all, the existing intervention for domestic batterers can cause backfiring effects and lead to more abuse. This is where experiments are useful: they provide evidence on relative utility, based on which we can make sound policy recommendations. Taking these points into consideration, the researcher has a duty to minimise these and other ethical risks as much as possible through a detailed plan that forms part of the research documentation portfolio.
Vitally, the decision to randomise must also then be followed with the question of which ‘units’ are the most appropriate for random allocation. This is not an easy question to answer because there are multiple options, thus the choice is not purely theoretical but a pragmatic query. The decision is shaped by the very nature of the field, settings and previous tests of the intervention. Some units are more suitable for addressing certain theoretical questions than others, so the size of the study matters, as well as the dosage of the treatment. Data availability and feasibility also determine [Page 6] these choices. Experimenters need to then consider a wide range of methods of actually conducting the random assignment, choosing between simple, ‘ trickle flow ’, block random assignment, cluster, stratification and other perhaps more nuanced and bespoke sequences of random allocation designs. We review each of these design options in Chapter 2 .
We then discuss issues with control with some detail in Chapter 3 . The mechanisms used to administer randomised experiments are broad, and the technical literature on these matters is rich. Issues of group imbalances, sample sizes and measurement considerations are all closely linked to an unbiased experiment. Considerations of these problems begin in the planning stage, with a pre-mortem assessment of the possible pitfalls that can lead the experimenter to lose control over the test (see Klein, 2011). Researchers need to be aware of threats to internal validity, as well as the external validity of the experimental tests, and find ways to avoid them during the experimental cycle. We turn to these concerns in Chapter 3 as well.
In Chapter 4 , we account for the different types of experimental designs available in the social sciences. Some are as ‘simple’ as following up with a group of participants after their exposure to a given treatment, having been randomly assigned into treatment and control conditions, while others are more elaborate, multistage and complex. The choice of applying one type of test and not another is both conceptual and pragmatic. We rely heavily on classic texts by D.T. Campbell and Stanley (1963), Cook and Campbell (1979) and the amalgamation of these works by Shadish et al. (2002), which detail the mechanics of experimental designs, in addition to their rationales and pitfalls. However, we provide more updated examples of experiments that have applied these designs within the social sciences. Many of our examples are criminological, given our backgrounds, but are applicable to other experimental disciplines.
Chapter 4 also provides some common types of quasi-experimental designs that can be used when the conditions are not conducive to random assignment (see Shadish et al., 2002, pp. 269–278). Admittedly, the stack of evidence in causal research largely comprises statistical techniques, including the regression discontinuity design, propensity score matching , difference-in-difference design, and many others. We introduce these approaches and refer the reader to the technical literature on how to estimate causal inference with these advanced statistics.
Before venturing further, we need to contextualise experiments in a wide range of study designs. Understanding the role that causal research has in science, and what differentiates it from other methodological approaches, is a critical first step. To be clear, we do not argue that experiments are ‘superior’ compared to other methods; put simply, the appropriate research design follows the research question and the research settings. The utility of experiments is found in their ability to allow [Page 7] researchers to test specific hypotheses about causal relationships. Scholars interested in longitudinal processes, qualitative internal dynamics (e.g. perceptions) or descriptive assessments of phenomena use observational designs. These designs are a good fit for these lines of scientific inquiries. Experiments – and within this category we include both quasi-experimental designs and RCTs of various types – are appropriate when making causal inferences.
Finally, we then defend the view that precisely the same arguments can be made by policymakers who are interested in evidence-based policy : experiments are needed for impact evaluations, preferably with a randomisation component of allocating cases into treatment(s) and tight controls over the implementation of the study design. We discuss these issues in the final chapter, when we speculate more about the link between experimental evidence and policy.
Contextualising randomised experiments in a wide range of causal designs
RCTs are (mostly) regarded as the ‘gold standard’ of impact evaluation research (Sherman et al., 1998). The primary reason for this affirmation is internal validity , which is the feature of a test that tells us that it measures what it claims to measure (Kelley, 1927, p. 14). Simply put, well-designed randomised experiments that are correctly executed have the highest possible internal validity to the extent that they enable the researcher to quantifiably demonstrate that a variation in a treatment (what we call changes in the ‘ independent variable ’) causes variation(s) in an outcome, or the ‘ dependent variable (s)’ (Cook & Campbell, 1979; Shadish et al., 2002). We will contextualise randomised experiments against other causal designs – this is more of a level playing field – but then illustrate that ‘basically, statistical control is not as good as experimental control’ (Farrington, 2003b, p. 219) and ‘design trumps analysis’ (Rubin, 2008, p. 808).
Another advantage of randomised experiments is that they account for what is called selection bias – that is, results derived from choices that have been made or selection processes that create differences – artefacts of selection rather than true differences between treatment groups. In non-randomised controlled designs, the treatment group is selected on the basis of its success, meaning that the treatment provider has an inherent interest to recruit members who would benefit from it. This is natural, as the interest of the treatment provider is to assist the participants with what they believe is an effective intervention. Usually, patients with the best prognosis are participants who express the most desire to improve their situation, or individuals who are the most motivated to successfully complete the intervention [Page 8] programme. As importantly, the participants themselves often chose if and how to take part in the treatment. They have to engage, follow the treatment protocol and report to a data collector. By implication, this selection ‘leaves behind’ individuals who do not share these qualities even if they come from the same cohort or have similar characteristics (e.g. criminal history, educational background or sets of relevant skills). In doing so, the treatment provider gives an unfair edge to the treatment group over the comparison group: they are, by definition of this process, more likely to excel. 1
The bias can come in the allocation process. Treatment providers might choose those who are more motivated, or who they think will be successful. Particularly if the selection process is not well documented, it is unsurprising that the effect size (the magnitude of the difference between treatment and control groups following the intervention) is larger than in studies in which the allocation of the cases into treatment and control conditions is conducted impartially. Only under these latter conditions can we say that the treatment has an equal opportunity to ‘succeed’ or ‘fail’. Moreover, under ideal scenarios, even the researchers would be unaware of whom they are allocating to treatment and control conditions, thus ‘blinding’ them from intentionally or unintentionally allocating participants into one or the other group (see Day & Altman, 2000). In a ‘blinded’ random distribution, the fairest allocation is maintained. Selection bias is more difficult to avoid in non-randomised designs. In fact, matching procedures in field settings have led at least one synthesis of evidence (on the backfiring effect of participating in Alcoholics Anonymous programmes) to conclude that ‘selection biases compromised all quasi-experiments ’ (Kownacki & Shadish, 1999).
Randomised experiments can also address the specification error encountered in observational models (see Heckman, 1979). This error term refers to the impossibility of including all – if not most – of the detrimental factors affecting the dependent variable studied. Random assignment of ‘one condition to half of a large population by a formula that makes it equally likely that each subject will receive one treatment or another’ generates comparable distribution in each of the two groups of factors ‘that could affect results’ (Sherman, 2003, p. 11). Therefore, the most effective way to study crime and crime-related policy is to intervene in a way that will permit the researcher to make a valid assessment of the intervention effect. A decision-making process that [Page 9] relies on randomised experiments will result in more precise and reliable answers to questions about what works for policy and practice decision-makers.
In light of these (and other) advantages of randomised experiments, it might be expected that they would be widely used to investigate the causes of offending and the effectiveness of interventions designed to reduce offending. However, this is not the case. Randomised experiments in criminology and criminal justice are relatively uncommon (Ariel, 2009; Farrington, 1983; Weisburd, 2000; Weisburd et al., 1993; see more recently Dezember et al., 2020; Neyroud, 2017), at least when compared to other disciplines, such as psychology, education, engineering or medicine. We will return to this scarcity later on; however, for now we return to David Farrington:
The history of the use of randomised experiments in criminology consists of feast and famine periods . . . in a desert of nonrandomised research. (Farrington, 2003b, p. 219)
We illustrate more thoroughly why this is the case and emphasise why and how we should see more of these designs – especially given criminologists’ focus on ‘what works’ (Sherman et al., 1998), and the very fact that efficacy and utility are best tested using experimental rather than non-experimental designs. Thus, in Chapter 6 , we will also continue to emphasise that not all studies in criminal justice research can, or should, follow the randomised experiments route. When embarking on an impact evaluation study, researchers should choose the most fitting and cost-effective approach to answering the research question. This dilemma is less concerned with the substantive area of research – although it may serve as a good starting point to reflect on past experiences – and more concerned with the ways in which such a dilemma can be answered empirically and structurally.
Causal designs and the scientific meaning of causality
Causality in science means something quite specific, and scholars are usually in agreement about three minimal preconditions for declaring that a causal relationship exists between cause(s) and effect(s):
- 1. That there is a correlation between the two variables.
- 2. That there is a temporal sequence, whereby the assumed cause precedes the effect.
- 3. That there are no alternative explanations.
Beyond these criteria, which date back as far as the 18th-century philosopher David Hume, others have since added the requirement (4) for a causal mechanism to be explicated (Congdon et al., 2017; Hedström, 2005); however, more crucially in the context of policy evaluation, there has to be some way of manipulating the cause [Page 10] (for a more elaborate discussion, see Lewis, 1974; and the premier collection of papers on causality edited by Beebee et al., 2009). As clearly laid out by Wikström (2008),
If we cannot manipulate the putative cause/s and observe the effect/s, we are stuck with analysing patterns of association (correlation) between our hypothesised causes and effects. The question is then whether we can establish causation (causal dependencies) by analysing patterns of association with statistical methods. The simple answer to this question is most likely to be a disappointing ‘no’. (p. 128)
Holland (1986) has the strictest version of this idea, which is often paraphrased as ‘no causation without manipulation’. That in turn has spawned numerous debates on the manipulability of causes being a prerequisite for causal explanation. As Pearl (2010) argues, however, causal explanation is a different endeavour.
Taking the three prerequisites for determining causality into account, it immediately becomes clear why observational studies are not in a position to prove causality. For example, Tankebe’s (2009) research on legitimacy is valuable for indicating the relative role of procedural justice in affecting the community’s sense of police legitimacy. However, this type of research cannot firmly place procedural justice as a causal antecedent to legitimacy because the chronological ordering of the two variables is difficult to lay out within the constraints of a cross-sectional survey.
Similarly, one-group longitudinal studies have shown significant (and negative) correlations between age and criminal behaviour (Farrington, 1986; Hirschi & Gottfredson, 1983; Sweeten et al., 2013). 2 In this design, one group of participants is followed over a period of time to illustrate how criminal behaviour fluctuates across different age brackets. The asymmetrical, bell-shaped age–crime curve illustrates that the proportion of individuals who offend increases through adolescence, peaks around the ages of 17 to 19, and then declines in the early 20s (Loeber & Farrington, 2014). For example, scholars can study a cohort of several hundred juvenile delinquents released from a particular institution between the 1960s and today, and learn when they committed offences to assess whether they exhibit the same age–crime curve. However, there is no attempt to compare their behaviour to any other group of participants. While we can show there is a link between the age of the offender and the number of crimes they committed over a life course, we cannot argue that age causes crime. Age ‘masks’ the causal factors that are associated with these age brackets (e.g. peer influence, bio-socio-psychological factors, strain). Thus, this line of observational research can firmly illustrate the temporal sequence of crime over time, but it cannot sufficiently rule out alternative explanations (outside of the age factor) [Page 11] to the link between age and crime (Gottfredson & Hirschi, 1987). Thus, we ought to be careful in concluding causality from observational studies. 3
Even in more complicated, group-based trajectory analyses, establishing causality is tricky. These designs are integral to showing how certain clusters of cases or offenders change over time (Haviland et al., 2007). For instance, they can convincingly illustrate how people are clustered based on the frequency or severity of their offending over time. They may also use available data to control for various factors, like ethnicity or other socio-economic factors. However, as we discussed earlier, they suffer from the specification error (see Heckman, 1979): there may be more variables that explain crime better than grouping criterion (e.g. resilience, social bonds and internal control mechanisms, to name a few), which often go unrecorded and therefore cannot be controlled for in the statistical model.
Why should governments and agencies care about causal designs?
Criminology, especially policing research, is an applied science (Bottoms & Tonry, 2013). It therefore offers a case study of a long-standing discipline that directly connects academics and experimentalists with treatment providers and policymakers. This is where evidence-based practice comes into play: when practitioners use scientific evidence to guide policy and practice. Therefore, our field provides insight for others in the social sciences who may aspire towards more robust empirical foundations for applying tested strategies in real-life conditions.
Admittedly, RCTs remain a small percentage of studies in many fields, including criminology (Ariel, 2011; Dezember et al., 2020). However, educators, or psychologists, or nurses do not always follow the most rigorous research evidence when interacting with members of the public (Brants-Sabo & Ariel, 2020). Even physicians suffer from the same issues, though to a lesser extent (Grol, 2001). So while there is generally wide agreement that governmental branches should ground their decisions (at least in part) on the best data available, or, at the very least, evidence that supports a policy (Weisburd, 2003), there is still more work to be done before the symbiotic relationship between research and industry – that is, between science and practice – matures similarly to its development in the field of medicine.
Some change, at least in criminology, has been occurring in more recent years (see Farrington & Welsh, 2013). Governmental agencies that are responsible for upholding [Page 12] the law rely more and more on research evidence to shape public policies, rather than experience alone. When deciding to implement interventions that ‘work’, there is a growing interest in evidence produced through rigorous studies, with a focus on RCTs rather than on other research designs. In many situations, policies have been advocated on the basis of ideology, pseudo-scientific methodologies and general conditions of ineffectiveness. In other words, such policies were simply not evidence-based approaches, ones that are not established on systematic observations (Welsh & Farrington, 2001).
Consequently, we have seen a move towards more systematic evaluations of crime-control practices in particular, and public policies in general, imbuing these with a scientific research base. This change is part of a more general movement in other disciplines, such as education (Davies, 1999; Fitz-Gibbon, 1999; Handelsman et al., 2004), psychology (among many others, see Webley et al., 2001), economics (Alm, 1991) and medicine. As an example, the Cochrane Library has approximately 2000 evidence-based medical and healthcare studies, and is considered the best singular source of such studies. This much-needed vogue in crime prevention policy began attracting attention some 15 years ago due to either ‘growing pragmatism or pressures for accountability on how public funds are spent’ (Petrosino et al., 2001, p. 16). Whatever the reason, evidence-based crime policy is characterised by ‘feast and famine periods’ as Farrington puts it, which are influenced by either key individuals (Farrington, 2003b) or structural and cultural factors (Shepherd, 2003). ‘An evidence-based approach’, it was said, ‘requires that the results of rigorous evaluation be rationally integrated into decisions about interventions by policymakers and practitioners alike’ (Petrosino, 2000, p. 635). Otherwise, we face the peril of implementing evidence-misled policies (Sherman, 2001, 2009).
The aforementioned suggests that there is actually a moral imperative for conducting randomised controlled experiments in field settings (see Welsh & Farrington, 2012). This responsibility is rooted in researchers’ obligation to rely on empirical and compelling evidence when setting practices, policies and various treatments in crime and criminal justice (Weisburd, 2000, 2003). For example, the Campbell Collaboration Crime and Justice Group, a global network of practitioners, researchers and policymakers in the field of criminology, was established to ‘prepare systematic reviews of high-quality research on the effects of criminological intervention’ (Farrington & Petrosino, 2001, pp. 39–42). Moreover, other local attempts have provided policymakers with experimental results as well (Braithwaite & Makkai, 1994; Dittmann, 2004; R.D. Schwartz & Orleans, 1967; Weisburd & Eck, 2004). In sum, randomised experimental studies are considered one of the better ways to assess intervention effectiveness in criminology as part of an overall evidence-led policy imperative in public services (Feder & Boruch, 2000; Weisburd & Taxman, 2000; Welsh & Farrington, 2001; however cf. Nagin & Sampson, 2019).
Chapter Summary
- What is meant by employing an experiment as the research method? What are randomised controlled trials (RCTs) and how are they different from other kinds of controlled experiments that seek to produce causal estimates? Why is randomisation considered by many to be the ‘gold standard’ of evaluation research? What are the components of the R–C–Ts (random–control–trial), in pragmatic terms? This book highlights the importance of experiments and randomisation in particular for evaluation research, and the necessary controls needed to produce valid causal estimates of treatment effects.
- We review the primary experimental designs that can be used to test the effectiveness of interventions in social and health sciences, using illustrations from our field: criminology. This introductory chapter summarises these concepts and lays out the roadmap for the overall book.
Further Reading
Ariel, B. (2018). Not all evidence is created equal: On the importance of matching research questions with research methods in evidence-based policing. In R. Mitchell & L. Huey (Eds.), Evidence-based policing: An introduction (pp. 63–86). Policy Press.
This chapter provides further reading on the position of causal designs within research methods from a wider perspective. It lays out the terrain of research methods and provides a guide on how to select the most appropriate research method for different types of research questions.
Sherman, L. W. (1998). Evidence-based policing . The Police Foundation.
Sherman, L. W. (2013). The rise of evidence-based policing: Targeting, testing, and tracking. Crime and justice, 42 (1), 377–451.
Evidence-based policy , or the use of scientific evidence to implement guidelines and evaluate interventions, has gained traction in different fields. In criminology, the scholar who has made the most profound contribution to ‘evidence-based policing’ is Professor Lawrence Sherman. On this topic, these two equally important papers should be consulted: Sherman (1998) systematically introduces a paradigm for evidence-based policing; and in Sherman (2013) the composition of evidence-based policing is laid out under the ‘triple-T’ strategy: targeting, testing and tracking. [Page 14]
1 Notably, however, researchers resort to quasi-experimental designs especially when policies have been rolled out without regard to evaluation, and the fact that some cases were ‘creamed in’ is not necessarily borne out of an attempt to cheat. Often, interventions are simply put in place with the primary motivation of helping those who would benefit the most from the treatment. This means that we should not discount quasi-experimental designs, but rather accept their conclusions with the necessary caveats.
2 We note the distinction between different longitudinal designs that are often incorrectly referred to as a single type of research methodology. We discuss these in Chapter 4 .
3 On the question of causality, see Cartwright (2004), but also see the excellent reply in Casini (2012).
R Is for Random
Sign in to access this content
Get a 30 day free trial, more like this, sage recommends.
We found other relevant content for you on other Sage platforms.
Have you created a personal profile? Login or create a profile so that you can save clips, playlists and searches
- Sign in/register
Navigating away from this page will delete your results
Please save your results to "My Self-Assessments" in your profile before navigating away from this page.
Sign in to my profile
Please sign into your institution before accessing your profile
Sign up for a free trial and experience all Sage Learning Resources have to offer.
You must have a valid academic email address to sign up.
Get off-campus access
- View or download all content my institution has access to.
Sign up for a free trial and experience all Sage Learning Resources has to offer.
- view my profile
- view my lists
(Stanford users can avoid this Captcha by logging in.)
- Send to text email RefWorks EndNote printer
Field experiments : design, analysis, and interpretation
Available online, at the library.
Business Library
More options.
- Find it at other libraries via WorldCat
- Contributors
Description
Creators/contributors, contents/summary.
- Introduction
- Causal inference and experimentation
- Sampling distributions, statistical inference, and hypothesis testing
- Using covariates in experimental design and analysis
- One-sided noncompliance
- Two-sided noncompliance
- Interference between experimental units
- Integration of research findings
- Writing an experimental prospectus, research report, and journal article
- Experimental challenges and opportunities.
Bibliographic information
- Stanford Home
- Maps & Directions
- Search Stanford
- Emergency Info
- Terms of Use
- Non-Discrimination
- Accessibility
© Stanford University , Stanford , California 94305 .
IMAGES
COMMENTS
Field experiments allow researchers to collect diverse amounts and types of data. For example, a researcher could design an experiment that uses pre- and post-trial information in an appropriate statistical inference method to see if an intervention has an effect on subject-level changes in outcomes.
In two field experiments, the Crime Lab found that while students participated in the program, total arrests were reduced by 28–35%, violent-crime arrests went down by 45–50% and graduation rates increased by 12–19%. When did field experiments become popular in modern economics? The earliest field experiments took place—literally—in ...
Experiments: Studying Mediation is More Difficult than Most Scholars Suppose.” Annals of the American Academy of Political and Social Sciences 628: 200-208. Suggested reading: Draft of Field Experiments: Design, Analysis, and Interpretation: Chapter 10. Using Experiments as Benchmarks for Evaluating Other Methods
Jul 24, 2023 · Field experiments, or randomized studies conducted in real-world settings, can take many forms. While experiments on college campuses are often considered lab studies, certain experiments on campus – such as those examining club participation – may be regarded as field experiments, depending on the experimental design.
Dec 31, 2015 · Field experiments are experiments in settings with high degrees of naturalism. This article describes different types of field experiments, including randomized field trials, randomized rollout ...
carried out within the general framework of a good experimental design. As ecologists have attempted to do more field experiments, the difficulties and pitfalls of experimental design have begun to emerge. There are many excellent textbooks on experimental design (for example, Cox 1958, Mead (1988), Hinkelmann and
Introduction. Formal textbooks on experiments first surfaced more than a century ago, and thousands have emerged since then. In the field of education, William McCall published How to Experiment in Education in 1923; R.A. Fisher, a Cambridge scholar, released Statistical Methods for Research Workers and The Design of Experiments in 1925 and 1935, respectively; S.S. Stevens circulated his ...
Students learn how to design randomized experiments, analyze the data, and interpret the findings. Beyond the authoritative coverage of the basic methodology, the authors include numerous features to help students achieve a deeper understanding of field experimentation, including rich examples from the social science literature, problem sets ...
May 29, 2012 · Written by two leading experts on experimental methods, this concise text covers the major aspects of experiment design, analysis, and interpretation in clear language to help students achieve a deeper understanding of field experimentation. Written by two leading experts on experimental methods, this concise text covers the major aspects of experiment design, analysis, and interpretation in ...
A limited number of options for field experiments are described to illustrate the relative strengths and weaknesses of different design features. The different design features demonstrate how the inclusion of pretests, posttests, and control groups may be used to strengthen the quality of inferences drawn from field experiments.