Conversations with code
Why behavioural scientists will be working alongside technologists in the emerging field of synthetic respondents.
There is growing interest in ‘synthetic respondents’, where LLMs (such as Chat GPT and or Google AI) use artificially created profiles to simulate the characteristics and responses of human survey participants. The potential opportunities of this are readily apparent – setting aside the considerable reduction in costs that come from conducting fieldwork with people, there can be a range of other benefits from offering ‘access’ to otherwise hard-to-reach populations and for the piloting questionnaires, to testing out products, services and solutions in speedy and agile way that would be hard to replicate with human participants.
But amidst the excitement that comes with any innovation, it is useful to consider the questions we should be asking when applying this. How confident can we be that this approach will deliver responses that we might expect to find in by interviewing humans?
At first glance, as we shall see, the results look impressive, but perhaps equally as importantly, do the principles being deployed stack up in a credible way? If we can properly understand the behavioural mechanisms involved in a human responding to a survey then we are in a much stronger position to determine where, when and how that might be something a synthetic respondent could do equally as well but also identifying when that would not be advisable.
Of course, asking questions of objects is nothing new. In ancient China, the I Ching, or ‘Book of Changes,’ offered 64 hexagrams to provide guidance for decision-making through the interpretation of patterns generated by tossing coins or sticks. A more recent example is ELIZA, an early computer program developed in the 1960s at MIT by Joseph Weizenbaum that simulated conversation by using pattern matching to echo users' statements in the form of questions, mimicking a therapist. This famously led some users to reportedly believe they were talking to a machine with human capabilities. While these cases have faded away over time, our enthusiasm for deriving insights about people from inanimate objects remains. The question is whether the latest version of this, LLMs, offers us genuine insights into humans, in contrast to the largely now abandoned experiments of the past.
Defining synthetic respondents (and synthetic data) is not straightforward – but rather than comparing different approaches to the technology, we want instead to consider in more detail the extent to which LLMs can replicate humans by exploring the human experience of completing a survey. Before we do that, we can consider the degree to which LLMs appear to have demonstrated their capability in doing exactly that.
So what is the empirical case for using LLMs to ask questions?
Drawing on a paper by Michael Fell, the evidence to date seems to suggest LLMs are pretty effective at providing answers to questions that closely reflect those collected from humans. Tools such as the Python programme PolyPersona, are used to create simulated survey respondents with various characteristics such as demographic details, attitudes, and personality traits based on predefined variables. These characteristics are probabilistically assigned to each LLM agent to mimic a diverse survey population: PolyPersona then generates system prompts that configure the LLM to respond in a way that is reflective of a particular respondent, effectively allowing the LLM to act like different people when answering survey questions.
This broad approach has been used to look at political questions such as vote prediction, psychological attributes such as personality traits, management studies, and economics , amongst others. Reviewing these, Lisa Argyle and colleagues found strong associations between human and GPT-3-52 generated responses.
There certainly looks to be some promise and the field is developing rapidly. But is it the whole story? What should we need to think about when considering how to approach this?
This is where we need to think carefully about the behavioural science of asking questions and getting answers.
What happens when we ask a question?
When considering the psychological mechanisms at play when we are asked a question, it is easy to assume that we operate in a somewhat computer-like way that that is set out by Anderson and Bower’s Associative Network Theory (ANT) of memory: they propose that memory (and in our case knowledge of ourselves) is structurally organised as a network of nodes linked together. This ‘structuralist’ approach likens our minds to computers, with knowledge being systematically filed away, with some of the storage is more easily accessible than others. This stored knowledge is then accessed when certain cues (in our case questions) are presented, leading us to retrieve it and report in the form of an answer.
But this is not the only model – ‘constructivist’ approaches suggest a more active and dynamic perspective on the way we handle knowledge. Psychologists such as Frederick Bartlett, see knowledge as something we construct through integrating different traces to build coherent meaning. This means that our position on something is always a constant work in progress, bringing together these different traces of information to cater for new questions that we are posed. Bartlett saw this as a positive characteristic, offering us flexibility in a world of inevitable change as we can make new connections from different parts of our experience and information that we hold. We can see some echoes of this in Nick Chater’s book, ‘The mind is flat’ when he suggests that we assemble our opinions in the moment rather than having a preconceived position on a topic.
As we shall see, this distinction between the structuralist and constructivist schools is important for evaluating how effectively a LLM can offer a human-like response. But first, what is the evidence to determine which of these mechanisms we use when we are asked a question?
The evidence for constructivism
In reality this is not really a binary either / or consideration – some aspects of how we handle knowledge can be quite ‘structural’ and at other times, more ‘constructivist.’ From the outside, however, it is easy to assume that a structured survey is entirely ‘structural’ in nature. Indeed, the process of administering a survey-participant dynamic looks very much like it operates in a computer-like way, with queries (questions in a survey) resulting in a data set (answers). And it is easier to think of qualitative research as more constructivist in nature as it involves more of a of a back-and-forth, a conversation that looks like humans making sense of things in words rather than ‘data’ being derived from a pre-scripted set of questionnaire items.
But simply because the survey looks this way, it does not mean that this is necessarily the case. Indeed, there is a great deal of evidence to suggest we operate in a dynamic, constructivist way when responding to a survey. The evidence for this is drawn from the huge literature that the questions we have considered and the answers we provide earlier in the process of responding to a survey influence what we say later in the same survey.
This is important because if the knowledge we each held was merely organised like a computer, where we simply access what we already know for the relevant store, then we would not see this happen. Instead, there is a huge body of evidence that subtle differences to the wording, ordering and framing of questions can result in quite significant differences.
This was recently illustrated in a recent Ipsos study which set out to empirically respond to a proposal from the 1980’s British TV series ‘Yes Prime Minister’ where one UK civil servant suggested to another that position and framing is very important in determining how people respond to a question. Two alternative question sequences were suggested in the TV show result, on the premise that a different outcome could be generated by changes in the ordering, framing and wording of the question about being in favour or opposing the reintroduction of National Service in Britain.
As shown above, the findings did indeed find differences in the responses, which is congruent with the way questionnaire design theorists such as Don Dillman call surveys a form of ‘social exchange’, emphasising the interaction between the survey and the respondent as a dynamic communication process and not simply the extraction of information.
Meta-cognition
Digging further into the psychology of this we turn to Norbert Schwartz and his work on meta-cognition. He suggests that the act of retrieving information to answer questions can affect how people feel about their responses: the ease with which information can be retrieved from memory (‘felt-fluency’) affects how significant, or frequent, people judge the retrieved content to be. This affects self-perception and decision-making, as easier-to-recall instances might be weighted more heavily when reflecting on the way we answered the question (our meta-cognition at play) than those that are not as readily accessible.
So, for example, we may ask "How satisfied are you with the battery life of your mobile phone?" and then a follow-up "Can you recall a recent instance when your mobile phone's battery lasted longer or shorter than expected?" By asking respondents to recall specific instances (ease of retrieval), confidence in their general satisfaction rating might be influenced. If a user easily recalls a positive experience (e.g., the battery lasting through a long day of use), their overall satisfaction rating could be enhanced due to the ease with which this positive memory came to mind.
This is a good case in point of the way that surveys are a dynamic process that can stimulate new connections and insights, where each answer not only contributes to the overall response set in its own right, but also deepens the respondent's engagement with their own experiences and perceptions. Indeed, we can often see a deepening of understanding as we proceed through a survey, as participants can start engaging in relational thinking, linking concepts, ideas, and experiences in ways they hadn't explicitly considered before.
Of course, this has implications for whether it is a human or machine responding, as while humans have the capacity for meta-cognition that can lead to new insights and changes in perspective, AI can at best produce responses that merely copy reflective thought, as it is not based on genuine self-awareness or meta-cognition.
It is this area that more exploration and understanding is needed about what happens during the act of being asked a question, surely reinvigorating the psychological work of survey theorists which has arguably faded into the background somewhat in recent times.
Representativeness
Moving slightly away from the psychology of the question-answer, another important consideration for any research is representativeness – ensuring that the sample obtained reflects the wider population. How do LLMs fare in this respect? Angelina Wang and colleagues point out that the ability of LLMs to replace human participants is of course wholly contingent on LLMs being able to represent the perspectives of different demographic identities.
In their paper, Wang and colleagues suggest that LLMs can struggle to represent marginalized groups accurately due to two key limitations. First, is what Wang calls ‘Misportrayal’: LLMs often fail to effectively represent the perspectives of different demographic groups. For example, when asked to simulate a person with a specific demographic identity the models tends to produce stereotypical responses that do not reflect the real diversity within that group.
Another limitation is what Wang calls ‘Group Flattening’: here LLMs tend to produce responses that homogenize the experiences and identities of diverse groups. This results in a loss of nuance and the erasure of subgroup heterogeneity (such as leading to a one-dimensional portrayal of complex identities by for example, ignoring intersectionality).
These limitations are due, the authors claim, to the way LLMs are typically trained on text data scraped from the internet, which rarely includes reliable indicators of the authors’ demographic identity. The training of LLMs often optimize for the most likely output rather than the most accurate or representative one, which tends to favour the most common views and expressions, further contributing to the problem of group flattening.
The authors therefore urge caution in using LLMs in settings where the representation of human identities is critical and recommend using them primarily as supplements to human judgment rather than replacements, especially in sensitive applications.
The philosophy of asking questions
So far we have discussed the way in which individuals handle knowledge and how can we be sure of that LLMs can represent this fully, both in terms of the representativeness of the population but also in the manner by which the answers are formed.
But perhaps the implication here is that, regardless of the way humans access the knowledge they hold, the knowledge itself has static properties, something just waiting for us to unearth it. Russian philosopher Mikhail Bakhtin challenges this assumption, suggesting that the way we all think about issues is through dialogue, meaning that the conversations we have assume a greater significance than we might always credit them. In which case, the knowledge we have of any issue is never complete or final and waiting for us to report on it but instead is always in-process, subject to revision and expansion in the light of new dialogues and interactions. We can see reflections of this in the work of psychologists Steven Sloman and Philip Fernbach who wrote:
“Humans are the most complex and powerful species ever, not just because of what happens in individual brains, but because of how communities of brains work together.”
The implication is that when asking questions, whether in person, in a survey or any other means, we are not simply extracting static information but instead it is an active, iterative process of constructing meaning and understanding. The knowledge on a topic is not necessarily fixed, known and waiting to be ‘unearthed’ but is instead something much more dynamic.
But what about that tracking data of attitudes and opinions which does not seem all that dynamic, given the flat lines in the data? Well, simply because knowledge is dynamic and constructed does not mean it changes quickly - there may be long waves of consistency in our shared position. But of course, in less predictable and more unstable environments that we are now living in, it is arguably more likely to change more frequently and in ways we may not be able to predict (either in terms of the type of change or the timing of any change).
Ethical considerations
The act of asking questions also has a wider ethical dimension in Bakhtin's view. To ask a question involves recognizing and respecting the otherness of others, their right to speak, and the validity of their perspectives. They are a way of ‘yielding the floor’ allowing others to express themselves and contribute to the shared process of meaning-making. They provide a means of challenging received-wisdom around societal norms. To not ask questions of people and not offer the means by which they can express their answers surely limits our access to these diverse perspectives.
And it may be that while the average data point may not change significantly in a survey, it could be at the edges that the most significant insights may be drawn – such as first inklings of change in a data point or a specific sub-group that appears to have a different position from the consensus. As historian Lorraine Daston suggests, if we are living in a period of radical uncertainty then we are thrown into a state of ‘ground-zero empiricism’ where we cannot rely on the knowledge of the past to understand the present. She suggests that as we struggle to make sense of a rapidly changing environment, we are much more reliant on “chance observations, apparent correlations, and anecdotes that would ordinarily barely merit mention”.
In conclusion
We are at a point where there are a number of different proposals emerging for ways to generate ‘synthetic respondents’ – and the finding that at least some of these seem to offer a consistency with human generated responses is interesting and exciting. Perhaps the age-old ambition of gaining insight into humans from asking questions of objects is getting tantalisingly closer.
But the empirical case for parity is not the only consideration – if we do not have a good understanding of the human processes that underpin the way a question is assimilated, and an answer given, then we are in danger of assuming synthetic respondents are always equivalent to human respondents. As we set out earlier, just because from a distance they can look the same, does not mean they are the same.
The Thomas Theorm states that if we “define situations as real, they are real in their consequences.” It’s easy to suggest that something is ‘good-enough’ and as such therefore run with the quicker, cheaper and seemingly sufficient option. But the danger is of course that it is hard to see what has been missed – the outputs may be highly plausible and consistent with what we expect to see but, really, is this why we asked a question in the first place?
Much of the work at the moment is examining how best to design LLMs in a way that are properly representative of the population – and the work of people such as Wang and colleagues give valuable insights into the steps that need to be taken to do this effectively (and the contexts best avoided based on known short-coming of LLMs). But perhaps what is less widely considered are the points we have raised, relating to the behavioural considerations of what happens when we ask a question of someone. If we do not understand this well enough, then we will not properly understand both the situations in which LLMs offer an effective tool but also when we should consider not using them (or using them with additional safeguards).
It is important in this content to remember there is a great deal of evidence that businesses which invest in research gain unique and often hard for competitors to replicate knowledge about consumers. This kind of learning environment requires businesses to continually adapt and refine their understanding based on customer feedback and behaviours. This is careful nuanced work – and while there is clearly the very exciting prospect of augmentation with synthetic respondents, we need to ask ourselves careful questions about when, what and how this is helpful. We cannot underestimate the importance of calling on behavioural science to help us to better understand what actually happens when we ask a question and receive an answer.
This is such an exceptionally thorough take on this. When I first heard about synthetic responses to surveys, I was super skeptical. I don't like the idea of reducing the complexity of humans down to such a "predictable" level. Love that this tackles some of the cool possibilities and also goes deep into the risks / necessary considerations. It also gave me more language for why I am so skeptical of personality assessments.