1 The Campbell score

This lecture presents a way to evaluate program evaluation studies by looking at the major validity threats that can be encountered. These threats were identified and listed by Donald T. Campbell in his article “Factors relevant to the validity of experiments in social settings” (full citation: Campbell, D. T. (1957). Factors relevant to the validity of experiments in social settings. Psychological Bulletin, 54(4), 297-312).

We will talk about them as the “Campbell Score” and discuss a way to identify them in a study and assign the study a “score” that summarizes how good it addresses validity threats.

There are 10 threats to validity that can be grouped into four main groups.

The first group includes omitted variable bias:

Selection / Omitted Variables
Non-Random Attrition

These two items are intimately linked with omitted variable bias in program evaluation studies. Since this is the most common and most problematic issue we worry about, rigorous evaluations need to demonstrate that this problem has been addressed in order to establish a baseline of internal validity. Since most observational studies will be significantly affected by selection and attribution problems, these first two items have a **guilty until proven innocent*"** criteria. In other words, the study must provide evidence or solid reasoning beyond simple speculation to show it addresses these threats.

The remaining 3 groups are listed below. The subsequent items are potential causes of concern for the internal validity of a study. Even if selection has been addressed, these other 8 things can impact our ability to generate valid causal inferences from a study, but they are less common so you cannot simply assume they will be a problem. You need to make a reasonable argument that the problem might exist in the study based upon data and evidence that is present, or sound logic and reasoning beyond speculation.

Trends in the Data

Maturation
Secular Trends
Testing
Seasonality
Regression to the Mean

Study Calibration

Measurement Error
Time-Frame of Study

Contamination Factors

Intervening Events

The Campbell Scores help you establish metrics for the quality of evidence provided in the study. Your job is to make a strong case. Use the definitions of Campbell Score items provided and evidence that is presented in the case studies to make your arguments.

We discuss each threat in more detail below.

2 Selection / Omitted Variables

If people have a choice to enroll in a program, those that enroll will be different than those that do not. This is a source of omitted variable bias.

The Fix: Randomization

How you evaluate it: Authors report enough evidence and support to show that (1) they have properly randomized the study sample and (2) randomization was successful (i.e., “happy” randomization - this might include reporting t-test results based on the Bonferroni correction or regression analysis).

2.1 Example 1: Lack of randomization

Workplace Wellness Programs Don’t Work Well. Why Some Studies Show Otherwise

Several studies have looked at employer-sponsored wellness programs to understand whether they are effective in improving employees’ health and wellbeing. Studies “look at programs in a company and compare people who participate with those who don’t. When those who participate do better, we tend to think that wellness programs are associated with better outcomes.”

Would you believe such findings? [Answer in the code]

# Since participants self-selected into participating, we would expect that results are biased. As the author says: "The most common concern with such studies is that those who participate are different from those who don't in ways unrelated to the program itself. Maybe those people participating were already healthier. Maybe they were richer, or didn't drink too much, or were younger. All of these things could bias the study in some way."

These are the results of a study conducted by a group of researchers who utilized RCT to study the same question:

Figure 2.1: Results from the study

They report:

“If we look only at the intervention group as an observational trial, it appears that people who didn’t make use of the program went to the campus gym 3.8 days per year, and those who participated in it went 7.4 times per year. Based on that, the program appears to be a success. But when the intervention group is compared with the control group as a randomized controlled trial, the differences disappear. Those in the control group went 5.9 times per year, and those in the intervention group went 5.8 times per year.”

Which internal validity threat could explain the difference between the two study?

# Selection bias might explain the significant findings in the observational trial. Individuals who are more likely to go the campus gym are also more likely to participate into the wellness program. Because of this, we observe a statistically significant difference between participant and non-participants. When the study randomly assigned participants to the program, we observe no difference.

More information on this study can be found here while the full article is available here
You can read another example here about laptop use during class time

2.2 Example 2: Happy randomization

Simple Changes to Job Ads Can Help Recruit More Police Officers of Color

This study was conducted by Elizabeth Linos to understand “How can the public sector attract more and different candidates without reducing the quality of applicants, with a specific focus on the police?” She argues that: “If we need new people to join [the police], we need to capture the motivations of those individuals who could be great in public service but aren’t currently applying.” (HBR, 2018)."

The study was designed in the following way:

“We selected almost 10,000 people at random between the ages of 18 and 40 and sent them postcards encouraging them to apply to the police department. We wanted to see whether a postcard could entice people to apply in greater numbers, compared with a randomly selected control group that received no such postcard. Importantly, we also created four variations of the postcard that emphasized a different motivation to join the police, to see what would be most effective at getting more applications through the door.” (in total 5 random groups were created) […] “The results were striking. The traditional service message was no more motivating at encouraging people to apply than not receiving a postcard at all (the control). In contrast, the challenge and career messages more than tripled the likelihood that someone would apply. For people of color and women, the impact was even larger. People of color who saw the challenge message were four times as likely to apply to the police.”

What would you look for to check whether a selection bias is affecting the results?

#A selection bias would occur if the randomization has failed and the 5 groups were different according to some key characteristics. To check this, we need to perform a t-test comparing each group again all the others for each variable that could affect our randomization (e.g., age, race, gender...). An alternative would be to estimate a regression model to predict the likelihood to be assigned in our group or the other.

Dr. Linos provides evidence against a selection bias by reporting the following table 2.2:

“The table below presents regressions to determine if pre-treatment characteristics predict whether someone received a treatment and if so, which treatment they received. It appears that none of the pre-treatment characteristics are predictive of treatment assignment, and therefore we should be comfortable that the randomization was successful” (Linos, 2018, p.74)

Figure 2.2: Happy randomization

Based on the table, can you conclude that the randomization was happy? What evidence do you have to support your argument?
Is there a selection bias in the study?

# As none of the predictor is significant at 0.05 level, we can conclude that the randomization was successful and selection bias is not affecting the results.

#Selection is here innocent; the author has provided evidence against its culpability.

2.3 Example 3: Is there a bias?

Two hours a week is key dose of nature for health and wellbeing

Spending at least two hours a week in nature may be a crucial threshold for promoting health and wellbeing, according to a new large-scale study.

Research led by the University of Exeter, published in Scientific Reports and funded by NIHR, found that people who spend at least 120 minutes in nature a week are significantly more likely to report good health and higher psychological wellbeing than those who don’t visit nature at all during an average week. However, no such benefits were found for people who visited natural settings such as town parks, woodlands, country parks and beaches for less than 120 minutes a week.

The study used data from nearly 20,000 people in England and found that it didn’t matter whether the 120 minutes was achieved in a single visit or over several shorter visits. It also found the 120 minute threshold applied to both men and women, to older and younger adults, across different occupational and ethnic groups, among those living in both rich and poor areas, and even among people with long term illnesses or disabilities.

Dr Mat White, of the University of Exeter Medical School, who led the study, said: “It’s well known that getting outdoors in nature can be good for people’s health and wellbeing but until now we’ve not been able to say how much is enough. The majority of nature visits in this research took place within just two miles of home so even visiting local urban greenspaces seems to be a good thing. Two hours a week is hopefully a realistic target for many people, especially given that it can be spread over an entire week to get the benefit.”

There is growing evidence that merely living in a greener neighborhood can be good for health, for instance by reducing air pollution. The data for the current research came from Natural England’s Monitor of Engagement with the Natural Environment Survey, the world’s largest study collecting data on people’s weekly contact with the natural world.

Co-author of the research, Professor Terry Hartig of Uppsala University in Sweden said: “There are many reasons why spending time in nature may be good for health and wellbeing, including getting perspective on life circumstances, reducing stress, and enjoying quality time with friends and family. The current findings offer valuable support to health practitioners in making recommendations about spending time in nature to promote basic health and wellbeing, similar to guidelines for weekly physical.”

Do you think that selection bias is an issue in this study?
How can it affect the results?

# Yes, selection bias can be an issue in this study as participation was not randomized.

# It can be that individuals who practice more sports are also more likely to spend time in the nature (e.g., they go running, jogging or hiking). This is very likely as the author says that "the majority of nature visits in this research took place within just two miles of home" - it could be that people were just going there to undertaken physical activity. If that is true, it might be that there is no actual difference between individuals who spend more or less than 120 minutes in the green.

3 Non-Random Attrition

If the people that leave a program or study are different than those that stay, the calculation of effects will be biased (e.g., we might under- or over-estimate the treatment effect).

The Fix: Examine characteristics of those that stay versus those that leave. Use the ITT (intention to treat) effect which minimize nonrandom attrition bias.

How you evaluate it: The study should report information on the number of individuals who left the study and any statistical difference from those who did not. The study should provide the ITT effect.

3.1 Example 1: Attrition bias

Cardiac patients who completed a longitudinal psychosocial study had a different clinical and psychosocial baseline profile than patients who dropped out prematurely

This medical study examines the link between “psychological factors (e.g. anxiety, depression, and the distressed (Type D) personality (i.e., the combination of negative affectivity and social inhibition traits))1 and prognosis in coronary artery disease”.

They discuss:

"We studied a cohort of 1132 consecutive patients undergoing percutaneous coronary intervention (PCI). At baseline, all patients completed the Hospital Anxiety and Depression Scale (HADS) and the Type D Scale (DS14). At 12 months follow-up, 70.8% (n=802) of patients completed both questionnaires, while 29.2% (n=330) dropped out. We observed significant differences in socio-demographic, clinical, and psychological baseline characteristics between completers and drop-outs.

Drop-outs were younger, more likely to smoke [.] as compared with completers. Drop-outs more often had depression, anxiety, and negative affectivity, as compared with completers (all p-values <0.05)."

Is there an attrition bias in this study? Report evidence supporting your argument.

 #Yes, there is an attrition bias in the study. The authors report to find significant differences between the drop-outs and the completers at 0.05 level. Drop-outs are younger, more likely to smoke, have depression and anxiety.

Let’s imagine that the researchers found a negative relationship between depression and prognosis (i.e., individuals who are more depressed has a worst prognosis) Given the differences between drop-outs and completers, how would you expect the estimated effect to be biased? In other words, will they under-estimate or over-estimate the effect?

# If depression is associated with a worst prognosis, the attrition bias might attenuate the effect of depression on prognosis and we will be underestimated its effect. Since drop-outs are more likely to have depression and anxiety, attrition has removed cases of individuals who are highly depressed and have worst prognosis. If those individuals were included, we might have found a stronger effect of depression on individual prognosis.

3.2 Example 2: Charter schools

The Evidence on Charter Schools and Test Scores

“A 2010 report by researchers from Mathematica Policy Research presented the findings from a randomized controlled trial of 36 charter middle schools in 15 states (Gleason et al., 2010). They found that the vast majority of students in these charters did no better and no worse than their counterparts in regular public schools in terms of both math and reading scores, as well as virtually all the 35 other outcomes studied. There was, however, important underlying variation - e.g., results were more positive for students who stayed in the charters for multiple years, and those who started out with lower scores (as mentioned above, CREDO reached the same conclusions).”

Can you highlight a potential attrition threats that emerge from this paragraph?
Discuss how it could impact the results of the study.

#A key concern in the paragraph is the possibility that some students might have left charter schools before the end of the study. The author says that "results were more positive for students who stayed in the charters for multiple years". It might be that the study suffers from attrition bias as students who left the charter schools earlier might have been the ones with lower grades (e.g., they were asked to leave). If the attrition is nonrandom, the estimates might be biased.

3.3 Example 3: Helping smokers give up tobacco

E-cigs ‘twice as effective’ than nicotine patches, gum or sprays for quitting

Source: Hajek P, Phillips-Waller A, Przulj D, et al. A Randomized Trial of E-Cigarettes versus Nicotine-Replacement Therapy, New England Journal of Medicine. Published online January 30 2019

“E-cigarettes are almost twice as effective at helping smokers give up tobacco than other alternatives such as nicotine patches or gum,” Sky News reports.

"E-cigarettes deliver a vaporised dose of nicotine, the addictive substance in tobacco. They don’t involve burning tobacco, which causes much of the health damage from smoking cigarettes. However, there’s been controversy about the safety of e-cigarettes and a lack of research about how effective they are in helping people to stop smoking.

Researchers carried out a trial with 886 smokers who sought help through NHS stop smoking services. People were randomly assigned to either nicotine replacement therapy (NRT) products (products such as patches or gums that can deliver a dose of nicotine) or e-cigarettes, plus one-to-one support for at least 4 weeks. After a year, 18% of e-cigarette users had stopped smoking tobacco, compared to 9.9% of NRT users."

In the original study, the authors report the following figure to describe the development of the study:

Figure 3.1: Study development

Looking at the figure, can we observe any attrition at any point of the study?
Can you determine if the attrition was random or nonrandom?

#Yes, the study started with 439 subjects in the e-cigarettes group and 447 in the NRT products group; by the follow-up test at 12 months the two groups had only 356 and 342 subjects respectively.

#No, the information provided do not allow us to determine if attrition is random or nonrandom. To determine if attrition is nonrandom, we need to compare the characteristics of the two groups (those who stayed and those who left) with a series of t-test to check for significant differences.

The authors state that “People who did not respond to follow-up calls or attend for carbon monoxide tests were classed as still smoking.”

Based on this decision, is the effect detected by the authors the Intention to Treat effect or the Treatment to the Treated effect?
Would have you made the same decision? Specify why.

# The estimated effect is the Intention to Treat effect. 

# Yes, the ITT effect is generally recommended to minimize the effects of nonrandom attrition.

4 Maturation

Occurs when growth is expected naturally, such as increase in cognitive ability of children because of natural development independent of program effects.

The fix: Use of a control group to control for maturation effects; reduce time between the pre-test and post-test measurement to reduce maturation effect; collect several data points at the beginning to assess what the individual trend looks like.

How to evaluate it: Look at whether the outcome could naturally change over time (e.g., skills, brain development). If so, does the author discuss the issue? Is the time interval between the pre- and post-test measurement short enough not to be affected by ‘natural’ growth?

4.1 Examples 1: Kid’s screen time

Is Screen Time Bad for Kids’ Brains?

“On Sunday evening, CBS’s”60 Minutes" reported on early results from the A.B.C.D. Study (for Adolescent Brain Cognitive Development), a $300 million project financed by the National Institutes of Health. The study aims to reveal how brain development is affected by a range of experiences, including substance use, concussions, and screen time."

“[.]As part of an expos? on screen time,”60 Minutes" reported that heavy screen use was associated with lower scores on some aptitude tests, and to accelerated “cortical thinning” - a natural process - in some children. But the data is preliminary, and it’s unclear whether the effects are lasting or even meaningful."

Moreover:

“[…] Individual variation is the rule in brain development. The size of specific brain regions such as the prefrontal cortex, the rate at which those regions edit and consolidate their networks, and the variations in these parameters from person to person make it very difficult to interpret findings. To address such obstacles, scientists need huge numbers of research subjects and a far better understanding of the brain.”

Which causes for maturation biases are discussed in these paragraphs? Report evidence from the text.
How could researchers address maturation bias?

# Maturation bias might be caused by the natural development of the brain. The authors say that potential sources of maturation bias are the "cortical thinning" as well as changes in the "prefrontal cortex, the rate at which those regions edit and consolidate their networks and variations in these parameters from person to person". 

# Researchers could collect data on these sources of maturation bias some time before and after the intervention so as they would be able to control for individual trends.

4.2 Example 2: Youth curfew and crimes

Juvenile Curfew Effects on Criminal Behavior and Victimization

This research conducts a review of several studies examining the relationship between youth curfew and criminal behavior. “Curfews restrict youth below a certain - usually 17 or 18 - from public places during nighttime. For example, the Prince George’s County, Maryland, curfew ordinance restricts youth younger than 17 from public places between 10 P.M. and 5 A.M. on weekdays and between midnight and 5 A.M. on weekends. Sanctions range from a fine that increases with each offense, community service, and restrictions on a youth’s driver’s license. Close to three quarters of US cities have curfews, which are also used in Iceland”

“[.] A juvenile curfew has common sense appeal: keep youth at home during the late night and early morning hours and you will prevent them from committing a crime or being a victim of a crime. In addition, the potential for fines or other sanctions deter youth from being out in a public place during curfew hours.” Yet they identify threats to validity in previous studies. In particular:

“[.] The most serious issue across these studies was the possibility of a maturational bias, with seven of the twelve studies judged as at risk. This assessment was based on the inadequate number of data points over time. All three of the studies categorized as short interrupted time-series designs and both of the pre-post designs had too few baseline observations to adequately assess whether any decrease in crime associated with the start of the curfew was part of an existing change over time. In addition, two of the longer interrupted time-series designs were also judged as at risk of a maturation bias. These two studies (Fivella, 2000; Mazerolle et al., 1999) used monthly data and had only 26 and 23 months of data respectively, which was only one year before and one year after the start of the curfew. Only four studies were judged to have a time-series of sufficient length to adequately control for the potential bias: Cole (2003), Kline (2012), McDowall et al. (2000), and Roman et al. (2003). Cole (2003) and Roman et al. (2003) used monthly data with multiple years of baseline data, whereas Kline (2012) and McDowall et al. (2000) used yearly data across multiple cities.”

Why did the authors of the review believe that these studies are affected by maturation bias?
What should have previous studies done to correct such bias?
Let’s imagine a negative maturation bias, whereas teens tend to decrease the number of crimes they commit over time. Researchers find a positive effect of the curfew. Based on this information, can you imagine how maturation bias might affect results in these studies?

# They fail to account for maturation bias because they did not include enough data points to assess whether changes before and after the implementation of the curfew were caused by individual trends.

# They should have collected more observations before and after the curfew to assess the individual trend.

# If the number of crimes was naturally decreasing, researchers might have overestimate the effect of the curfew.

#Secular Trends

Very similar to maturation, except the trend in the data is caused by a global process outside of individuals, such as economic or cultural trends.

The fix: Control for macro-level trends.

How to evaluate it: authors should recognize and control for social, cultural or economic trends if they are likely to affect their outcomes (e.g., economic cycle if looking at economic outcomes, crime trends if looking at detention rates or criminal behaviors).

##Example 1: Youth curfew and crimes

Juvenile Curfew Effects on Criminal Behavior and Victimization

In the previous section, we discuss how maturation might affect studies linking youth curfew with criminal behavior. The authors of the review also notice that:

“Contrary to popular belief, the evidence suggests that juvenile curfews do not produce the expected benefits. The study designs used in this research make it difficult to draw clear conclusions, so more research is needed to replicate the findings. However, many of the biases likely to occur in existing studies would make it more, rather than less, likely that we would conclude curfews are effective. For example, most of these studies were conducted during a time when crime was dropping throughout the United States. Therefore, our findings suggest that either curfews don’t have any effect on crime, or the effect is too small to be identified in the research available.”

What is the secular trend identified in this paragraph?
How did the authors of the review believe that the secular trend bias has affected the results of the research?
How is the secular trend bias different from the previously discussed maturation bias?

#The secular trend identified in the paragraph is the decrease in crime throughout the United States.

#Since crime was decreasing across the US, the research might have overestimated the effect of the curfew. In other words, research has found either a greater effect of curfew on crimes or an effect whereas the curfew is actually ineffective. 

#The secular trend bias is a macro-level trend (e.g., crime rates are declining across the US). The maturation bias is at the individual level: teens tend to behave differently as time passes (e.g. they might commit fewer crimes).

4.3 Example 2: Biased effect of cultural trends

Couples who share housework more likely to divorce, silly study finds.

"In what appears to be a slap in the face for gender equality, the report found the divorce rate among couples who shared housework equally was around 50 per cent higher than among those where the woman did most of the work.

“What we’ve seen is that sharing equal responsibility for work in the home doesn’t necessarily contribute to contentment,” said Thomas Hansen, coauthor of the study entitled “Equality in the Home”. The lack of correlation between equality at home and quality of life was surprising, the researcher said. “One would think that break-ups would occur more often in families with less equality at home, but our statistics show the opposite,” he said. The figures clearly show that “the more a man does in the home, the higher the divorce rate,” he went on.

How could a secular trend influences the result of this study?

#A possible explanation is that couples who equally share housework are also more progressive and open to divorce than traditional couples. It might be that the authors is capturing a cultural change, whereas couples who do not share housework are more traditional and less inclined to accept divorce as an option. The study has no control group to observe whether couples in the same age cohort that split or do not split housework equally are more or less likely to divorce.

4.4 Example 2: How might secular trends affect research?

Secular trends and smoke-free policy development in rural Kentucky

A randomized control trial was designed to evaluate smoke-free policy (i.e., rules that prohibit smoking in indoor spaces and designated public area) in Kentucky and their effectiveness in reducing smoking. The study is described as follows:

“A 5-year community-based randomized control trial (RCT) to test the effects of a stage-based, tailored intervention [smoke-free policy] on policy outcomes [smoking behavior] is currently in progress in rural Kentucky. While building capacity, resources and tobacco control efforts at the community level can help advocates advance tobacco-free policy [11], policy outcomes may not be solely attributed to tobacco control interventions. Community events unrelated to smoke-free policy may impact the outcomes of community-based interventions and threaten internal validity of RCTs. For example, a rural community may receive a large grant to bolster the tobacco farming infrastructure.”

Researchers are concerned about secular trends, among which:

Creation of new healthcare infrastructure
Substance abuse - increase incidence of drug use and decrease of treatment programs
Economic regression

Taking into account this information:

Describe how each of the above mentioned secular trends might affect the validity of the study. Would we underestimate or overestimate the effect of the policy?
Propose another secular trend that might impact the validity of the study

# The creation of new healthcare infrastructures might provide better access to information regarding the consequences of smoking, which reduce smoking behavior - i.e., we would overestimate the effect of the policy or find a significant effect even if the policy was unsuccessful. 

#The increase use of other substances might increase individual propensity to smoke. Smoking-free policies might be less effective because of a substance-increase trend - i.e., we would observe a small or null effect even if the policy is effective.

#Economic regression is also associated with an increase in smoking related to stress or depression. An economic regression might substantially reduce the effectiveness of smoking-free policies - i.e., we would observe a small or null effect even if the policy is effective.

#Another secular trend could be a general improvement in wellbeing and health-related behavior. If individuals start to care more about their wellbeing and health, they might reduce smoking independently of any smoking-free policies. In this case, we might observe a significant effect of the policy even if it is not effective in reality.

#Testing

When the same group is exposed repeatedly to the same set of questions or task they can improve independently of any treatment.

The Fix: This problem only applies to a small set of programs. Change tests, use post-test only designs, or use a control group that receives the test.

How to evaluate it: Has the researcher asked the participants to take the same test more than once? Are pre- and post-question the same to measure cognitive abilities?

##Example 1: Family History

Providing Patient Education: Impact on Quantity and Quality of Family Health History Collection

In this study, researchers want to “evaluate the influence of patient education on the quantity and quality of Family Health History (FHH) entered into a primary care-based software program [.] One to two weeks prior to their well visit appointment, participants entered their FHH into the program. Participants were then provided educational materials describing key FHH components. They were instructed to use the interval to collect additional FHH information (NOTE: this could include more information about relatives’ causes of death, diseases.).”

“Patients then returned for their scheduled appointment and updated their FHH with any new information. [.] Post patient education, [the information provided] exhibited a greater percentage of: deceased relatives with age at death (84 vs. 81 % p = 0.02), deceased relatives with cause of death (91 vs. 87 % p = 0.02), relatives with a named health condition (45 vs. 42 %p = 0.002), and a greater percentage of relatives with high quality records (91 vs. 89 % p = 0.02). [.] Patient education improves FHH collection and subsequent risk stratification utilized in providing actionable evidence-based care recommendations for cancer risk management”

Reading the paragraph below, could you identify why researchers should be concern by testing bias?

#Since the family health history was collected using a program's software, improvements in the family history in the second round might be due to an improvement in the skills needs to use the software.

You can read here how the researchers solved the issue. If you have doubts that there might be multiple testing bias in the study, authors should report something similar:

“To investigate potential repeated testing bias (participants may perform better on second MeTree [NOTE: the software used to collect information] session due only to familiarity with MeTree) associated with two or more MeTree sessions, we examined the change in number of siblings reported before and after patient education for each pedigree [NOTE: each family member]. We hypothesized this number should not change as a function of education and would only change as a function of application familiarity. We found no changes in number of siblings for any pedigree between pre and post MeTree sessions.”

Explain in your own words why this is a solution to testing bias.

#Researchers hypothesize that individuals know about siblings and they will not report them only if they don't know how (i.e., they don't know how to use the software). Therefore, they expect that the number of reported siblings would increase pre and post the intervention if the multiple testing bias occurs. They found no evidence in support.

#Regression to the Mean

Every time period that you observe an outcome, during the next time period the outcome naturally has a higher probability of being closer to the mean than it does of staying the same or being more extreme. As a result, quality improvement programs for low-performing units often have a built-in improvement bias regardless of program effects.

The Fix: Take care not to select a study group from the top or bottom of the distribution in a single time period (only high or low performers).

How to evaluate it: Was the study performed on a group of individuals that were scoring particularly high or low outcomes?

Other resources: https://fs.blog/2015/07/regression-to-the-mean/

4.5 Example 1: Test of psychiatric treatment

The Impact of Regression to the Mean in Psychiatric Drug Studies

“Hengartner provides a real example of a clinical trial of repetitive transcranial magnetic stimulation (rTMS) for veterans with depression. The participants selected for this trial had undergone several trials of medication and did not improve, so they were considered”treatment resistant." Many of the veterans also had diagnoses of PTSD, substance abuse, and were deemed to be suicidal.

Since these veterans had already experienced several trials of medication without improvement, they had already experienced the potential for the placebo effect (receiving treatment intended to remedy their symptoms)-and it had not worked for any of them. Thus, it could be expected that the placebo response rate in this study would be very low, close to zero.

Instead, after less than two weeks, an astonishing 37% remitted after fake rTMS treatment. A slightly higher 41% remitted after the real treatment (note that this difference was not statistically significant, meaning that the actual treatment was no better than placebo). That is, 37% had remission of symptoms, not just improvement, meaning they improved so much they no longer had the diagnosed disorder.

Now, this is an incredibly high number. According to this study, 37% of people could be cured of depression, PTSD, suicidality, and substance abuse, after having tried drugs with no improvement, all by using a fake treatment, in less than two weeks. The researchers of that study, according to Hengartner, pin that improvement on the placebo effect-attention from medical staff and the belief that this treatment would work."

Do you trust the results?
What alternative explanation could justify the observed results?

#Results could be biased and not trustful as the sample includes individuals who had "extreme" scores in depression, PTSD, suicidality and substance abuse.

#A possible explanation is a regression to the mean. The regression to the mean phenomenon suggests that we should have expected an improvement in individuals' score at one point in time *independently* from the treatment they were going to receive. Even without doing anything, we would have probably observed those changes. This is because very high (or very low) individual scores tend to go back to the mean at one point.

4.6 Example 2: Flight instructor

N.F.L. Week 8 Game Probabilities (Regression to the Mean)

"Daniel Kahneman is a behavioral economist who witnessed a similar phenomenon firsthand. In the late 1960s, Kahneman was a consultant for the Israeli Air Force. He lectured instructor pilots on the latest research that showed that reward was far more effective than punishment for improving trainee performance. The instructor pilots were not buying it.

They told Kahneman: “When student pilots have a bad flight, we yell and scream at them, and the next day they tend to do better. But when they have a good flight, we’ll praise them like you suggest, and they tend to do worse.”

It was then that Kahneman realized how natural variation in performance, and its natural regression to the mean, were fooling the flight instructors into believing that it was their yelling and screaming that improved the student pilots’ performance."

Can you explain Kahneman’s realization?
What did the instructor see and what did he see as the causal mechanisms explaining the performance improvement?

#Kahneman suggest that a student pilot who just had a bad flight would naturally improve in the second flight as a result of the regression to the mean. Vice versa when a pilot had a very good flight and was praised, he will have a worst flight to go back to his natural mean. 

#These effects would have been observed even without the intervention (yelling and screaming or praising the pilot).

##Example 3: Football performance

N.F.L. Week 8 Game Probabilities (Regression to the Mean)

“With the Redskins’ struggles this season, Washington fans are screaming for the head of Jim Zorn. John Fox, Dick Jauron, Wade Phillips, Lovie Smith and even Jeff Fisher have also been mentioned as coaches on the hot seat. Whoever gets the boot, one thing is certain: Teams will mistakenly attribute improvements in performance to the replacement of their coaches.”

Why is the author convinced that the team will improve?
Why does he believe that the improvements will be mistakenly attributed to the new coach?

# As the team is struggling in this season, the author believes that it will improve in the next season because of the "regression to the mean phenomenon" which suggests that individuals who have very high or very low scores will naturally go back to a mean level. 

#The improvement will be mistakenly attributed to the new coach because the team would have improved anyhow - with that coach or another one - because of the regression to the mean.

4.7 Example 4: Honey as a cure for cold

Honey ‘as good as antiviral creams’ for cold sores

"“Honey is ‘just as effective at treating cold sores as anti-viral creams’,” the Mail Online reports.

Cold sores are skin infections around the mouth caused by the herpes simplex virus (HSV). You catch the virus through direct skin contact with another person who has the virus.

Once you have it, HSV lies dormant in the nerve cells and can reactivate at another time, which is why some people get recurrent cold sores, particularly when they’re run down. A common treatment for cold sores is an antiviral cream called aciclovir.

A new study randomised nearly 1,000 adults with HSV to apply either aciclovir cream or “medical grade” New Zealand kanuka honey to the skin. There was no significant difference in the time taken for the sore to heal: 8 days with aciclovir and 9 days with honey.

The results do not show that honey was better than antivirals, it just seemed to work as well."

Do you trust these results?
Is that possible that we are observing a regression to the mean?

#Results are not entirely trustful as there were no control group where "nothing" was taken to cure the cold.

#It is possible that we are observing a regression to the mean effect as the cold would have been cured within 9 days even in the absence of treatment

#Seasonality

Data with seasonal trends or other cycles will have natural highs and lows.

The Fix: Only compare observations from the same time period, or average observations over an entire year (or cycle period). Collect data over multiple periods of time

How to evaluate it: Were data collected over or referred to a specific period of time (e.g., holidays, seasons, festivals..)? Are there seasonal factors that might affect the outcome levels (e.g., stress levels in summer vs the fall semester vs final weeks)?

4.8 Example 1: Seasonality in question.

Measuring Time Use in Development Settings

Research aims to decrease poverty in developing countries is increasingly recognizing the importance of “time” as a resource for individuals. Measuring time is important to understand topics such unpaid household work among women and labor division within families.

Researchers at the International Food Policy Research Institute discuss several methods to collect data and seasonality biases that might be related to them:

“Time diaries. Research questions that require information on broad categories of activities such as market and non-market work, or questions around the intrahousehold division of labor, are typically collected using a 24-recall time diary [NOTE: participants are requested to annotated each activity they undertake over a period of 24 hours] . The time diary provides consistency in the time data by forcing a full accounting of time and uses a 24-hour recall to minimize recall bias. However, using 24-hour recall increases seasonality bias, unless surveys are repeated multiple times in a year. Even within seasons, one cannot assume that the previous day was”typical," and so additional questions are usually recommended in order to distinguish between patterns of time allocation that are out of the ordinary (e.g., holidays, festivals, etc.)"

Why do the authors argue that 24-recall time diary might be subject to seasonality biases? Imagine that you are asking individuals to annotated each activity they undertake on Thanksgiving and on March 23rd.
How do they propose to the fix the bias?

#Seasonality bias might occur if we are collecting data that vary by season. In this case, if we were to collect data on Christmas we would obtain very different information than collecting data on a random day of the year, such as March 23rd. 

#They propose to collect data at multiple points in time so that researchers would be able to identify unusual patterns connected to holidays or other specific periods of the year. 

#A good example of seasonal variation are farming activities or summer vs school year for students.

#Measurement Error

If there is significant measurement error in the dependent variables, it will bias the effects towards zero and make programs look less effective.

The Fix: Use better measures of dependent variables.

How to evaluate it: Do researchers report evidence of the validity and reliability of their measurement? Do the survey items seem appropriate to measure the concept?

4.9 Example 1: Measuring depression

Brief depression questionnaires could lead to unnecessary antidepressant prescriptions

A group of researchers wanted to test whether questionnaire measurement was overestimating depression rate.

“The exploratory study included 595 patients of primary care offices affiliated with Kaiser Permanente in Sacramento, San Francisco VA Medical Center, Sutter Medical Group in Sacramento, UC Davis, UC San Francisco and VA Northern California Healthcare System.”

“[.] Based on a review of medical records, the patients were divided into two groups: those who were asked during their doctors’ office visits to complete brief depression symptom questionnaires, besides the one administered by the researchers, and those who were not. The groups were compared in terms of rates of depression diagnoses and prescriptions for antidepressants received from their physicians.”

“Of the 545 patients who did not complete brief depression questionnaires during their doctors’ office visits, 10.5 percent were diagnosed with depression and 3.8 percent were prescribed antidepressants.”

“Of the 50 patients who completed brief depression questionnaires during their doctors’ office visits, 20 percent were diagnosed with depression and 12 percent were prescribed antidepressants.”

“Jerant said the study highlights the need for research to determine the best ways to apply brief depression questionnaires in daily practice, as use of the screeners tripled the likelihood that patients in the study who were not apt to be depressed would receive depression treatment.”

According to the authors, what is the direction of the measurement error of depression? In other words, do we under- or over-estimate depression?
Imagine you are conducting a study where depression is your outcome variable. How would this measurement error bias your results?
Can you indicate one reason why the survey measurement might be failing?

# According to the authors, questionnaires tend to overestimate depression. 

#If depression is your dependent variable, then you might obtain biased results who tend towards zero - e.g., the program will be found ineffective even if it is effective. 

#There are might self-reporting concerns, whereas individuals tend to overestimate their depressive status; it might that the scale is not valid (e.g., it is not actually measuring depression) or unreliable (e.g., including some items that do not fit together).

#Time-Frame of Study

If the study is not long enough it make look like the program had no impact when in fact it did. If the study is too long then attrition becomes a problem.

The Fix: Use prior knowledge or research from the study domain to pick an appropriate study period.

How to evaluate it: Has the intervention last enough? Is there enough time for the treatment to be effective? Has too much time passed so that the effect of the treatment have diminished / vanished?

##Example 3: Time frame - Two studies

Bigger portions lead to preschoolers eating more over time

[.] Alissa Smethers, a doctoral student in nutritional sciences, said the findings – recently published in the American Journal of Clinical Nutrition – suggest that caregivers should pay close attention to not just the amount of food they serve but also the variety of food.

[.] Smethers said that while it was known that adults are likely to eat more when served larger portions of food over time, it was thought by some researchers that young children can sense how many calories from food they need and adjust their eating habits accordingly, a process called “self-regulation.”

Previous studies have tested this theory by looking at children’s eating habits at one meal or over a single day. But Smethers said it may take longer – up to three to four days – for self-regulation to kick in, and so she and the other researchers wanted to study the portion size effect in children across a full five days.

The researchers recruited 46 children between the ages of three and five from childcare centers at the University Park campus for the five-day study. All meals and snacks were provided for the children, who during one five-day period received baseline-sized portions – based on Child and Adult Care Food Program requirements – and during another period had portions that were increased in size by 50 percent.

[.] During both five-day periods, the children were allowed to eat as much or as little of their meals or snacks as they wanted. After the children were done eating, the leftover foods were weighed to measure how much each child consumed. Additionally, each child wore an accelerometer throughout each five-day period to measure their activity levels, and the researchers measured their height and weight.

After analyzing the data, the researchers found that serving larger portions led to the children eating 16 percent more food than when served the smaller portions, leading to an extra 18 percent of calories.

“If preschoolers did have the ability to self-regulate their calorie intake, they should have sensed that they were getting extra over the five days and started eating less,” Rolls said. “But we didn’t see any evidence of that.”

According to the article, is the study time frame critical in studies on portion size and food intake?
What were the results of the first studies after only one meal?
What was Dr. Smethers initial hypothesis?
Did she find confirmation after the trial?

# Yes, according to the article the study time frame is critical to studies on portion size and food intake.

#After only one meal, students who received larger portion size were more likely to eat more than when they receive a smaller portion. 

#Dr. Smethers hypothesized that there might be self-regulating effect in the long term: after receiving larger portion size over a period of 4-5 days, children might start to regulate their food intake.

#We expand the timeframe of the study to include 5 meals. Results provided further confirmation that children eat more when they receive larger portion sizes. She found no confirmation for her initial hypothesis.

##Example 2: Time frame - Significant results

Mindful body awareness training during treatment for drug addiction helps prevent relapse

A novel type of body awareness training helps women recover from drug addiction, according to new research from the University of Washington. People in the study made marked improvement, and many improvements lasted for a year.

It’s the first time the mindfulness approach has been studied in a large randomized trial as an adjunct treatment. The training helps people better understand the physical and emotional signals in their body and how they can respond to these to help them better regulate and engage in self-care.

[.] The training included one-on-one coaching in an outpatient setting, in addition to the substance use disorder treatment the women were already receiving. The intervention is called Mindful Awareness in Body-oriented Therapy (MABT) and combines manual, mindfulness and psycho-educational approaches to teach interoceptive awareness and related self-care skills. Interoceptive awareness is the ability to access and process sensory information from the body.

Researchers studied 187 women at three Seattle-area locations. The cohort, all women in treatment for substance use disorder (SUD), was split into three relatively equal groups. Every group continued with their regular SUD treatment. One group received SUD treatment only (TAU), another group was taught the mindfulness technique in addition to treatment (MABT), and the third group received a women’s education curriculum (WHE) in addition to treatment in order to test whether the additional time and attention explained any positive study outcomes.

Women were tested at the beginning, and at three, six and 12 months on a number of factors including substance use, distress craving, emotion regulation (self-report and psychophysiology), mindfulness skills and interceptive awareness. There were lasting improvements in these areas for those who received the MABT intervention, but not for the other two study groups. “Those who received MABT relapsed less,” Price said. “By learning to attend to their bodies, they learned important skills for better self-care.”

In support to their findings, the researchers provide the following table 4.1:

Figure 4.1: Study results

Look at the results of the table, particularly the last three columns. Significant differences are indicated with a *. Does the MABT group report better outcomes in all study period? What about the other groups? Does the study time frame matter for the results?

# The MABT group reports better outcomes only after 6 and 12 months and only compared to the TAU group.

# The WHE has better results after 3 and 6 months and only compared to the TAU group. 

# Yes, the time frame matters. Depending on when the outcome variable was measured, results are different. In the short term, the WHE is more effective compare to the TAU group but in the long term the only significant difference is between the MABT and the TAU group.

4.10 Example 3: Time frame - Outcome measurements

Dietary advice and self-weighing may help avoid Christmas weight gain

This research aims to understand if giving people basic dietary advice could help them avoid the weight gain that usually follow Christmas festivities.

"Researchers recruited people via schools (parents, not children), workplaces and on social media in Birmingham. The study first ran in 2016 and was repeated in 2017. Adults with a body mass index (BMI) of at least 20 (meaning they weren’t underweight) were weighed in November or December, before Christmas. Half were randomly assigned to receive brief lifestyle advice, including 10 tips on managing weight at Christmas. Participants in this group were also advised to weigh themselves ideally every day, but at least twice a week, and were given information on the activity-equivalent of foods commonly eaten at Christmas.

The other half were instead given a leaflet on healthy lifestyles.

In January or February, people were weighed again. Researchers looked to see which group had lost or gained most weight since the baseline, adjusting the figures to take account of people’s starting weight and whether they’d taken part in commercial weight loss programs.

People in the weight-control intervention group lost on average 0.13kg - a small amount, but important given that most people put on weight at Christmas. Those who received the leaflet (the control group) gained an average of 0.37kg. The adjusted difference in weight was an average -0.49kg between the groups (95% confidence interval [CI] -0.85 to -0.13). People who had the weight control advice were more likely to report thinking about how much they ate and restricting their eating than those in the control group, and to weigh themselves more often.

Consider the timeframe of the research as reported in the article stating that people were weighted in January OR February. How could it affect the study results? Does the timeframe seem appropriate?
Can you draw conclusion about the long-term effect of the study?

#The time frame could affect the study results. It might that individuals who were weighted in February have already lost part of their weight compared to individuals weighted in January. Depending on when they are weighted, measurement might be significantly different. 

#Given the time frame, we can only draw short-term conclusion about weight gain.

#Intervening Events

Has something happened during the study that affects one of the groups (treatment or control) but not the other?

The Fix: If there is an intervening event, it may be hard to remove the effects from the study.

How to evaluate it: Consider if the authors note of any event that occured during the study.

5 Test your knowledge!

5.1 Exercise 1

The Project on Devolution and Urban Change Assessing the Impact of Welfare Reform on Urban Communities: The Urban Change Project and Methodological Considerations

The Urban Change project aims to explore the effect of welfare policies (e.g., AFDC, TANF, Food Stamps) on individual outcomes such as, welfare and employment and well-being. Before conducting the study, researchers identify several sources of bias as reported below. Which biases do they refer to?

Any welfare recipient will, over time, leave the rolls as her youngest child reaches the age of 18.
Enhancement of the local JOBS (Job Opportunities and Basic Skills Training) program or a state-mandated reduction in welfare grants, unrelated to the new TANF legislation
A city’s rate of welfare receipt may gradually increase as middle-class households move away to suburbs outside the city limits
At the time of application, people are likely to have reached a natural low in their earnings and other resources, so called “pre-program dip”

#Maturation bias: over time, an individual's situation might naturally change and the social benefits might no longer apply.

#Intervening events: new training programs might be launched that improve the individual economic situation unrelated to the programs under evaluation. 

#Secular trend: There might be changes in the city composition as effect of time. 

#Regression to the mean: People who apply to those benefits are at a lower in their life.

5.2 Exercise 2

(Do Drug Courts Work? An Outcome Evaluation of a Promising Program)[ https://www-jstor-org.ezproxy1.lib.asu.edu/stable/43481406]

“We estimated multivariate statistical models to determine whether drug court participation effectively reduced the likelihood of recidivism for participating offenders compared to non-drug court participants, while controlling for covariates and both selection and maturation bias.”

“[.] bias occurs when the outcome naturally evolves or changes over time irrespective of the presence of an independent variable of interest such as drug court participation. Arrests sometimes occur at the peak of an individual’s criminal career since engaging in many criminal acts increases the likelihood of arrest. The arrest, and any subsequent intervention, may in turn serve as a catalyst for change [bias 1], or the offender may return to more conventional (i.e., less criminal) behavior simple because they are coming down from their peak of offending [bias 2], and the latter may have happened in the absence of arrest or subsequent intervention. Either will lead to observed lower rates of arrest following the instant arrest.”

Which bias is identified in this paragraph?

#Regression to the mean: An arrest is more likely to occur when an individual is engaging in several criminal acts. A decrease might be a natural regression to the mean.

5.3 Exercise 3

Do ‘heavily processed’ foods increase the risk of an early death?

“Study links heavily processed foods to risk of earlier death,” reports The Guardian.

Researchers reported that middle-aged French people who ate 10% more so-called “ultra-processed” food had a slightly increased chance of dying over a 7-year period compared with those who ate less.

The researchers describe ultra-processed food as “food products that contain multiple ingredients that are manufactured through a multitude of industrial processes”. They give examples as including “mass produced and packaged snacks, sugary drinks, breads, confectioneries, ready-made meals and processed meats”. While some of these foods may be unhealthy, it seems unhelpful to group together nutrient-free sugary drinks and ready-made vegetable soups, for example.

As one dietitian points out: “Bread or biscuits baked at home would not be considered ultra-processed, whereby shop bought versions would, despite identical ingredients.”

[.] The researchers who carried out the study were from Sorbonne Paris Cit? university and H?pital Avicenne, both in France. [.] Researchers used data from the ongoing NutriNet-Sant? Study of 44,551 French adults, which began in 2009. Volunteers aged 45 or older completed a series of online questionnaires about their health, socioeconomic status, family history, lifestyle and other information. They filled in at least 3 24-hour dietary records during an average 7 years of follow-up until 2017.

Researchers used the questionnaires to calculate the proportion by weight of total food intake categorised as ultra-processed. After adjusting their figures to take account of a range of potentially confounding factors, they calculated the link between the proportion of ultra-processed food in the diet and the chances of having died during the follow-up period.

[.] During the 7 years of follow-up, there were 602 deaths (1.4% of the people who started the study). The researchers say 219 were caused by cancer and 34 by cardiovascular disease, but did not report causes of death for the other 349, so we do not know whether they could have been related to diet.

[.] People who ate more ultra-processed food were likely to be younger, on a lower income, have a lower education level, be living alone, have a higher BMI, and do less physical activity. They were also less likely to adhere closely to the French nutritional recommendations.

The researchers calculated that each additional 10% increase in proportion of ultra-processed food in the diet (by weight) was linked to a 14% increased risk of death (hazard ratio [HR] 1.14, 95% confidence interval [CI] 1.04 to 1.27).

But when they excluded deaths in the first 2 years of the study and people who had cancer or cardiovascular disease at the start of the study, the association was no longer statistically significant - it could have been down to chance.

[.] It’s quite difficult to unpick any useful messages in this study because of its many limitations. The main limitations are:

an unclear definition of ultra-processed food, which may not be a particularly helpful term as it bunches together very different foods based on how they were made, rather than what’s in them
the observational nature of the study, which means it cannot show cause and effect
the self-selecting volunteer population, which is likely to represent people particularly interested in nutrition and health and not the general population
the fact people could choose which 24-hour period to record their diet, which may mean they were more likely to record a healthy eating day than an unhealthy day

Because so many different types of food are included in the “ultra-processed” category, it’s impossible to tell which foods might have contributed to the small increased risk in deaths among the people taking part in the study. [.]

Can you identify the biases that are discussed at the points 1, 3, and 4?

# Measurement: individuals might not be aware of what ultra-processed food is if the study did not provide a clear definition

# Selection: participants in the study are volunteers. 

# Seasonality: they might recall a particular day rather than a "normal" one.

Campbell scores