A gold standard that does not always glitter
While the Randomised Control Trial remains a helpful tool for behavioural scientists, is it time to rethink its 'gold standard' status?
This is a story about the ‘gold standard’ of behavioural science: Randomised Control Trials (RCTs) and how we think the behavioural science industry can challenge itself to be more creative in evaluation and testing.
We illustrate this with the topic of letters. Because letters are often something that behavioural scientists are tasked to evaluate and optimise, considering ways to improve them in order to enhance response rates so that people take-up offers, respond to requests for information and so on. Different ways of designing letters can be tested: changing letter headings, the phrasing of the call to action, selecting a new font and so on. There are many possibilities for ‘interventions’ to change behaviour of course.
With a range of ideas in hand, then RCTs are a frequent choice for the testing of these interventions prior to their launch, either in a controlled environment or as part of a field trial. On this basis, a means is created of trialling a number of different designs and assessing the degree to which they result in behaviour change. This may involve showing people different versions and then asking them about their propensity to respond or field trials where behaviours like making payments, returned forms, or, if a digital letter, click through rates are tracked. We can then look at the differences between the options and then make a decision of which to activate.
In this instance we certainly consider RCTs have their place but we think there is a case for considering what other testing tools we need to use, as RCTs do come their limitations. We first set out our take on these limitations and then make the case for some fresh thinking in testing and evaluation before returning to the case of letter testing.
Effect size
There is a long tradition in academic work of controlling the range of different variables in the environment so that we can focus on the one or two variables that are of interest. This reflects the way in which academia best works: incremental additions to a body of knowledge which allow us to assess and develop our theoretical understanding of an issue. In this context, the point of an experimental design is what it tells us about the underlying theory and causal relationships.
Practitioners, however, have a different focus and a fundamentally different orientation to both theory and the variables being worked with. Sometimes the practitioner will, of course, be interested in theoretical advancement as it can be used to inform the design of interventions. But typically the practitioner is most interested in the degree of impact on outcomes that the proposed intervention will have.
And with this, the purpose of the testing is therefore different for the practitioner. First, while statistically significant differences on a very small movement may be interesting to the academic as it offers an advancement to the body of knowledge, the practitioner is required to question if the intervention is actually worth implementing. There are costs involved with any intervention activation which may mean that they are not worth the investment if the movement is small, even if there is statistical significant movement. This means we are interested in the degree to which the intervention changes the outcome – in other words the effect size.
And yet there is often little, if any, evaluation of this which, we might speculate, is a function of the focus being on statistically significant differences rather than the effect size. This is not an inevitable outcome of using RCTs but the way they are structured and their historical usage, which encourage this orientation.
Evaluative bias
If we are having to think upfront about how we will test something using the strictures of an RCT, then there is a danger of selecting interventions and outcomes for which this is easier. Of course, there is a longstanding notion that ‘what gets measured get managed’ but this does need to be balanced against that which is measured may not always be what is most important (or as Einstein famously said, ‘The things that count cannot be counted’.)
An RCT design typically looks at one intervention at a time, pushes us to what Lisa Feldman Barret calls a ‘mechanistic mindset’ the notion that one or two variables shape behaviour. In fact, we suspect that most practitioners agree with her assessment that in fact behaviours are shaped by a broad number of weak interacting factors. This means that typically more than one or two interventions are needed (in fact we typically suggest a programme), working in combination with each other, tackling different behavioural mechanisms.
RCTs are typically designed to accommodate one intervention at a time – and although they can be made more sophisticated, there are still limits. And while one intervention can of course address more than one of the factors driving behaviour, we will often need multiple interventions, particularly for complex behaviours. In other words, there is a danger that the evaluation method of an RCT inadvertently narrows the way we consider designing a behaviour change programme.
Contextual factors
A key point of an experimental design is to control for contextual factors – this is to be able to establish the causal relationship between the intervention and the outcomes. But surely as practitioners we do this at our peril. If we assume there is a range of factors at play in shaping behaviour then we are in danger of missing the way that some of these factors (such as age, gender, household income, ethnicity) are actually very important in shaping behaviour.
As such, the RCT method emphasises a focus on the key manipulations, assuming it will be relevant across different groups. The replication issue faced by psychology is perhaps a warning of the danger here – arguably this was the result of ignoring ‘contextual factors’ which then change over time. We agree with Dilip Soman and Nina Mažar when they say:
Rather than narratives along the lines of “A causes B,” it would be helpful for our leaders to highlight narratives such as “A causes B in some conditions and C in others.”
Measurement of behaviours
Not unreasonably, the outcome variable is often considered to be behaviour of interest. At face level this should be the acid test – have we achieved a change in behaviour? And again, while an RCT does not mean that this is a requirement, this is often accepted practice.
But there are challenges with this. First, to what extent is it reasonable that an intervention, in one fell swoop, is able to deliver the desired behaviour? As we have set out elsewhere, change is often a process rather than a moment in time act. Change often requires getting the attention of the audience, helping them to make sense of what is asked of them and then supporting enactment of the behaviour. Which then begs the question of what the outcome variable should be. We cannot ignore the temporal dimension – if we see behaviour change as a process, rather than something that changes as a result of an intervention then testing at one point in time does not help.
Second, measuring actual behaviour can be problematic. Take a simple example of testing the extent to which a new app influences people walking behaviour. There are all sorts of complexities here about deploying technology, participant adherence and analysis complexities. While not insurmountable, these sorts of projects can require a huge logistical infrastructure to accomplish measurement that has debateable outcomes and logic.
And a final key point here, measuring behaviour as a result of an intervention assumes a simple pathway for behaviour change when in fact there is usually a cyclical pattern of behaviour, with people trying out the behaviours, getting feedback from the behaviour itself as part of the change process. Behaviour change is often iterative not direct, and often not linear.
An alternative perspective
We suspect that enthusiasm for RCTs can reflect a view of behaviour that is mechanistic, assuming that if we can change one or two factors in the environment then we can change behaviour. And perhaps there is even an implied Stimulus-Response model of behaviour change (rather than it being a more organic and indirect process). While this approach is understandable given the nature of much behavioural science literature, it is certainly one that is widely challenged as a representation of much behaviour.
Of course, as we highlighted previously here and here, RCTs are far from the only tool we can be using for testing and evaluation. There are many other approaches we can and indeed, in our view, should be using. We propose an approach that we shall call MVTAs (multi-variable-temporal-assessments).
We focus on multi-variable as:
It goes beyond a singular outcome measure: The target behaviours (and adjacent ones for spill-over, backfires effects etc) are measured but importantly we measure the behavioural dimensions that we know are responsible for shaping behaviour (e.g., identity, emotion, capability). These can be drawn from a behaviour change model such as MAPPS or COM-B.
Multiple (and contextual) factors are part of the design: This approach also allows for a potentially large sample size so that we can capture and model the effect of different contextual factors on the outcomes. On this basis, context is part of the design rather than the noise in the data.
More variable interventions can be considered: We are also able to evaluate the impact of multiple interventions (which operate across different aspects of the problem), allowing us to assess multiple interventions that are designed to address the multiple factors shaping behaviour.
And the temporal-assessment means that:
The passage of time receives focus: Measurement takes place over time as having a staged approach allows us to reflect the way behaviour change has a temporal element.
The temporal nature of the intervention can be considered: We are also able to test different versions of the interventions over this period, capturing the way in which changes relate to the outcomes. This set up also means we can include an agile element where we can change course as the effectiveness (or otherwise) of the interventions becomes apparent.
Overall, this approach calls for longitudinal / tracking work – and where possible, integrated with measurement of behaviours. While of course this approach is not without its own challenges, we consider it offers an alternative model to the RCT, that more directly addresses the information needs of the behavioural science practitioner, policy maker and marketer.
In conclusion
If we return to our example of letter optimisation then we can see how, in this context, RCTs can indeed work. Done well, this is a speedy and efficient way of evaluating the effectiveness of different interventions, facilitating decision making.
But of course, what is missed out in this approach is that many people do not actually open and read the letter. So the interventions are then only working on a subset of those receiving it. This illustrates the narrow mandate of RCTs – they work well in a formally defined problem domain, what we might call a ‘closed system.’ This is where the situation is considered known and measurable.
However when something happens that has not been ‘programmed’ or accounted for in advance then there is no means of integrating it, regardless of how important or relevant it is. This is a problem as most human behaviour takes place in an ‘open system’, a context with a wide range of factors shaping behaviours, often in hard to predict ways. As such the tools we used for testing and evaluation need to be selected and designed with this ‘open system’ in mind.
This is similar to the ‘frame problem’ in AI. Outside of that which has been programmed, the machine does not ‘know’ what information is important and what is irrelevant in a way that we, as humans, take for granted.
We consider, and indeed use, a variety of RCT designs in a range of testing challenges and for locating foundational insights about the nature of causal relationships. But we are calling for a wider debate about their strengths and limitations and with this we can surely question their position as always being the ‘gold standard’ of the behavioural scientist’s testing toolkit.