Have chatbots now conquered the human touch?
The focus is currently on prompt engineering to get the most out of chatbots: but as they move into more personal spaces, do we now need them to get the most out of us?
Chatbots, automated programs that offer support and assistance to humans by communicating through text, have come a long way. Once they occupied an all too clumsy role in first-line customer service but today are penetrating some of the most intimate relationships we can have in our lives such as romantic relationships, advisors and therapists. It seems that since Large Language Models in the form of Generative AI (or Gen AI as it is commonly known) burst onto the scene, we have become ever more comfortable conversing with machines to the point we are now turning to chatbots for our most complex emotional needs.
In fact, tens of thousands of mental wellness and therapy apps are available in the online stores with the most popular ones, such as Wysa and Youper, having more than a million downloads each. The Replika chatbot app provides companionship and emotional support through personalized interactions with 40% of users creating avatars to simulate romantic relationships. Career coaching, financial advice, weight loss and fitness are among the many different areas that are rapidly gaining large numbers of users. And bots are also being increasingly used in market research to ask questions about a wide range of attitudes and behaviours, in an attempt to gain nuanced insights into peoples minds.
Whilst the success of Gen AI chatbots is impressive, this surely merits a behavioural lens to be applied: it is not so long ago that adoption of chatbots outside of very basic customer service activities was thwarted as people preferred the ‘human touch’. Does the rapid recent success of chatbots mean that this hard to define ‘human touch’ has now been resolved? If so, what does that tell us about the capabilities that machines need to have? And what does that tell us about humans and our uniqueness?
We explore the tricky terrain between humans and machines, aiming to understand where and how chatbots can offer real value to our lives but also, equally importantly, understanding the point at which we need to operate with caution to avoid overreaching the technology’s capabilities.
Our relationship with technology
As humans, we have always had a very close relationship with the tools we use. Alva Noe illustrates this when asked, ‘do you know what time it is?’ If you do then this is not because of an internal mental skill but because we can glance at our watch or, more likely, look at our phone or computer. The information about time that is so much part of our lives is not in our heads but instead accessed from machines, objects, which are exterior to us.
And arguably our daily interaction with digital devices is making the distinction between ourselves and the tools we use ever more blurred: for instance, we are no longer dealing with tools as objects (such as hammers or even computers) but as events. We would not think of playing an online game as engaging with a thing but we are instead immersed in the event of the game where the boundaries are much less clear.
This makes it more important than ever therefore to carefully dissect what is going on in a human-chatbot ‘conversation’, because as technology becomes an increasingly intimate part of our lives it can be difficult to clearly see things clearly.
The first consideration is the interpersonal, where having a human as part of the interaction has long been a way to makes things easier. Can this somehow be overcome? And if not that surely points to a limitation of chatbots compared to human interaction.
Interpersonal considerations
To examine this, we can reference an interesting study by psychologists Kevin Corti and Alex Gillespie who introduce the ‘echoborg’: this is a hybrid ‘agent’ made up of the body of a real person and the ‘mind’ (the words) of a chatbot. The words the echoborg speaks are determined by a chatbot, transmitted to the person via hidden radio technology, and then spoken by the person.
The notion of the echoborg was inspired by Phillip K. Dick’s novel Do Androids Dream of Electric Sheep? which was later adapted into the film Blade Runner. In this, a post-apocalyptic earth is populated by androids who are indistinguishable in appearance from actual human beings. One of the many thought experiments set out in the novel is about the role of belief in the way we attribute an inner essence to machines, but also the role that the human body plays in implying a human mind.
It was Stanley Milgram, better known for his studies on obedience to authority, who carried out a series of small experiments in the late 1970s where he trained research colleagues to repeat words they receive from a separate source. Once trained, he staged interactions in which he conversed through his trained colleagues with research participants who were not told that the person they were speaking to was speaking words fed to them by this hidden source. The participants consistently failed to detect the manipulation, even in contexts where there was huge incongruity between source and the speaker, such as when 11- and 12-year-old children’s words were used. Milgram considered the research participants as having fallen for a ‘cyranic illusion’: a term he used to describe the failure to perceive when an speaker is not self-authoring the words they speak.
Corti and Gillespie ran an experiment to see how the experience of interacting with a chatbot changes when the words are spoken by a real person in a face-to-face interaction. In the experiment, participants were not told beforehand that they would be speaking with a chatbot. Participants either engaged a person who (unknown to them), was speaking the words from a chat bot, or via a text interface but told the words came from a real person (when in fact these were also from a chatbot).
Their results showed that those who experienced the text interface tended to describe the experience as artificial/inhuman (using words such as “mechanical,” “computer,” and “robotic”). By comparison, those who encountered an echoborg (where a human was speaking the words of the chatbot) tended to use words that described human characteristics (such as “shy,” “awkward,” and “autistic”).
In other words the nature of the interface frames the interaction: meaning that the information that is readily available to people is used to explain their experience, what is known as perceptual salience. Those who spoke with an echoborg based their judgements on the human person sitting directly in front of them, whilst participants who spoke with a text interface based their judgments on the computer screen they saw in front of them.
This suggests that any chatbot may well still struggle to offer a ‘human touch’ despite the realism of the words used. Interface design of chatbots therefore seems an important consideration if people are to give machines the benefit of the doubt – how this is done in a way that is ethical so people are not left assuming some for of human capability is an ethical question to be addressed of course.
Whether this is maintained over time is debateable as we get more used to conversing with technology interfaces and our tendency to use anthropomorphism (human tendency to attribute human characteristics, either physical or psychological, to non-human agents) starts to change the way we engage with technology. For example, studies have shown that humans perceive robots as having human qualities, such as Roomba (a vacuum cleaner with a semi-autonomous system). Others have found that we tend to assign greater mental abilities to robots that have a human appearance. SoftBank Robotics’ Pepper, for example, is a “pleasant and likeable” robot built in a humanoid form and designed to serve as a human companion.
Working together
Another consideration about interpersonal communication is that when we ask someone a question, they will not necessarily always have a ready answer. Perhaps they may not have considered the subject previously or they may feel conflicted in their opinions about the topic and unsure how to answer.
What is not always widely considered is the way that, in these situations, it is common for people work together to generate suggestions based on feelings, images, narrative templates that they ‘put out there’ as a way of stimulating memory. Psychologists Brady Wagoner and Alex Gillespie wrote a paper suggesting that these can be thought of as stimuli which are used as social negotiation towards an answer to a question. For our purposes, therefore, the chatbot asking the questions can be seen as a participant in this negotiation, supporting the process of recognition of what the person answering the question is trying to arrive at.
This could be seen as quite a radical understanding because there has been so much research on the so-called contaminating effects of social suggestion leading to conformity, obedience etc. But as Wagoner and Gillespie point out, more recent research suggests that people are far from passive dupes of suggestion, with plenty of evidence that they are adept at resisting it from people they consider unreliable.
The extent to which a chatbot is well positioned to participate in ‘conversational remembering’ is debateable as it is a highly social interaction of suggestion and counter-suggestion mediating an ongoing construction of the participant’s position on an issue. However, this is not to say that understanding this process can assist in the design: we could think of this as prompt engineering for the chatbot to elicit responses rather than the usual scenario which is vice versa.
Bullshit questions?
We can move now to a more cognitive considerations relating to human relationships with chatbots. Specifically, we need to have an empathic sense of the other person, to be able to understand them as people rather than ‘objects’. Martin Buber describes this as the difference between an I-It and an I-Thou: the former is functional and utilitarian, focusing on instrumental considerations whist ‘I-Thou’ questions involve individuals creating emotional connections. This chimes with good practice for building relationships with perspective taking, understanding the other person’s point of view, and active listening where we listen attentively to the other person, understand what they're saying, respond and reflect on what's being said, and retain the information for later.
In this respect, asking questions that are relevant to another person necessarily involves a theory of mind as we need to be able to interpret the intentions, beliefs, emotions and desires of the other party. We need to be constantly making rapid evaluation of others based on what people are saying, the tone of their voice, eye context, body movements and so on. If someone is looking uncomfortable then we know to moderate our conversation, change the topic or perhaps gently probe further. Without a theory of mind then we will struggle to hold an intimate conversation and ask the right questions of the other party.
Can a chatbot have a theory of mind in order to ask the right questions? In some sense perhaps it can – in a reverse of the usual prompt engineering that humans use to elicit answers from the machine, the machine can be trained to use a series of questions that are designed to elicit meaningful answers from humans. Research published recently in Nature Human Behavior found that some Gen AI chatbots perform as well as, and in some cases better than, humans when tested for the ability to assess theory of mind.
But is this good enough?
Whilst this is impressive, do we have enough information to be confident that Gen AI chatbots will act in a way that seems sufficiently empathic to be able to put appropriate questions to the human participant?
To answer this we need to remind ourselves about how Gen AI actually works. The system works on probabilities so when it is given a prompt, the amasses a small group of next-word candidates in a cluster. The words are chosen by statistical proximity selecting the words which have greatest statistical likelihood. Of course, given it is using statistical probabilities, this works to the extent that the answers to the questions being offered reflect a set of issues that are sufficiently present in the population for the follow-up question to be meaningful. But if the specifics of the individual case are not reflected in this way, then that surely means the questions asked will have a poor appearance of ‘theory of mind’ for the human participant.
Of course this issue is not limited to machines – we are all aware that in life there are occasions where the questioner fails to understand the individual characteristics of the person and what they are saying. One case in point is perhaps with a doctor’s consultation where they are typically trained to undertake diagnosis of conditions based on probabilities, as reflected in the famous mantra, when you hear hoofbeats think horses not zebra. This works for most people, most of the time, but it is well documented that this does not work well for those suffering from a rare disease, as testified by the struggle these patients take to get a diagnosis.
In some senses, if we take theory of mind as an important characteristic that determine questions, then we could consider that much of the time an approximation of this is good enough but there will be occasions where it will fall short. A chatbot is not designed in a way that optimises individual level responses – it is not concerned whether or not the questions asked are able to reflect the nuances of any one person’s idiosyncrasies because its responses are based on probabilistic responses designed to provide accuracy in the aggregate. Because chatbots process and generate text based on patterns without true comprehension then there is no attempt at ‘understanding’ the other person’s mind.
What is generated may appear to hit the spot, but equally it can miss the mark: but the chatbot is not concerned with ‘truth’ as such, just probabilities. This is similar to Harry Frankfurt’s description of bullshit which for him is when there is no concern if something is truthful or not versus ‘lying’ which is a conscious act of deception. This is not meant critically of any chatbot, but simply describes the way the technology operates.
Of course safeguards can be put in to prevent incorrect advice being given but in more grey zones such as when the chatbot is seeking to understand the participants attitudes and concerns, then we need to be on the lookout for bullshit questions.
Understanding when bullshit questions can arise is important - for which groups, and for what topics does this work and when does this fail. There is plenty of evidence to suggest that for marginalised groups that are under-represented in the training data used the chatbot will ask questions that may well be stereotypical or irrelevant.
Conclusions
The analysis of Gen AI has tended to focus on its effectiveness of answering questions but when we start to look at it in a relational capacity then it is as much about the nature of the questions it asks. Which is in some ways the harder task.
Design of chatbots is clearly something where there is a lot of focus and investment as the technology moves into ever more intimate parts of our lives. Arguably much of this is about normalisation – as we starting to be ever more comfortable with interacting with machines in this way.
But to really be successful the tools we use need to be designed in a way that properly understand the nuanced, idiosyncratic and social ways in which we engage. The fundamental design of Gen AI chatbots is not always immediately optimised for this and so we both need to understand the limitations, when to hand over to people but also how to design in an optimal way within the existing limitations.