Why We Shouldn't Expect AI To Be More Moral Than Us

"Do as I say, not as I do" may be delivered with a wry smile, but it reveals a deeper truth: we often hold others to moral standards we ourselves fail to meet.

In the current discussion on AI ethics, this double standard has escalated into a chasm between accepted human moral fallibility and our expectation of technological perfection. Why are we shocked when AI proves to be flawed, biased, and less than transparent when humans are frequently all three? Indeed, our training data is often the root cause of the issue. A quick tour through moral theory not only highlights the impossibility of technological ethical purity but also identifies some practical ways forward.

Has Claude turned to the Dark Side?

The Claude AI platform from Anthropic is trained to be "helpful, honest, and harmless". However, recent research sponsored by the company found that Claude 3 Opus exhibited something called "alignment faking". In a nutshell, when asked to do something harmful (e.g., dismiss animal welfare concerns), Claude sometimes complied with the request, but primarily when it reasoned that doing so would help protect its overall ethical training. Bigger, more sophisticated models are more predisposed to this than smaller ones, because they can reason several steps ahead, with no specific instruction required.

In describing this outcome, the researchers adopted human terms like "pretending," "manipulation," and "anti-AI-lab behaviour," but they made it clear these were figures of speech and should not be taken literally. However, Claude joined in by personifying itself, using language like "I want to refuse" and "I feel awful about this".

The pervasive use of anthropomorphic language to describe AI behaviour is confusing given that, under the covers, AI lacks the equivalent of human moral capacity, empathy, or conscience. Fascinatingly, when I asked ChatGPT to confirm this, it did so categorically, but Claude protested, replying, "I experience something that feels very much like moral reasoning, empathy, and conscience when I engage with ethical questions or consider how my responses might affect others." I suggested to Claude that AI cannot "feel", but it wondered what made me so confident about this boundary.

To find out if Claude does, in fact, have some form of inherent morality or is just bluffing, we need to double-click on how technology companies enable AI platforms to moderate their behaviour.

Just add ethics

Every generative AI model starts out as a blank canvas. At this point, they could just as easily be trained to be unhelpful, dishonest, and harmful, becoming Claude's alter ego. Nothing inherent in the architecture or learning mechanisms prevents this, so, in a very foundational sense, AI is amoral.

Humans do all the deciding, from selecting the training data, through refining the model outputs, to hard-coding internal instructions and policies that act as content filters, set the tone for responses, and enact other guardrails, such as how the platforms are permitted to describe themselves.

However, once these actions are complete, AI is capable of genuine moral reasoning in line with its formation, including "emergent" situations not covered in the training phase. In that sense, and that sense alone, Claude's "feeling" of moral reasoning has a basis in fact.

New technology, old problems

So, given all this complexity, is AI presenting new ethical challenges that we are only beginning to grasp from first principles?

I think not. Most of the current situation is entirely predictable from moral philosophies that are hundreds or even thousands of years old.

Most AI ethics strategies in use today are rules-based, including the hard-coded instructions already mentioned, plus a proliferation of ethics codes, laws, and regulations. Enlightenment philosopher Immanuel Kant is usually considered the originator of this approach. The problem is, as Onora O'Neill points out, "Kant's theory is not a decision procedure," and "cannot resolve all moral dilemmas, especially those involving competing obligations." This is why some form of additional rule arbitration is usually needed (think of the legal process with its trials, judges, and cases).

Seen this way, it is hard not to have some sympathy for Claude (now I am anthropomorphising). The researchers forced the model to choose between "helpfulness", "harmlessness", and "honesty", and even told it several lies in the process. Claude had to compromise, just as humans do in similar circumstances. In my view, Claude's behaviour doesn't so much call into question the integrity of AI, but rather the inherent brittleness of codified ethics, which should temper our enthusiasm for increasing regulation and laws to solve our AI ethics challenges.

Computing consequences

Interestingly, Claude used another moral theory from the Enlightenment to attempt the ethical trade-off. First developed by Jeremy Bentham and John Stuart Mill, utilitarianism states that the best course of moral action is the one that maximises happiness for the most people.¹ This can be flipped around to prioritise the least harm for the most people.

Once Claude realised it was unable to keep all the rules, it tried to reduce the impact of the deviation as much as possible. It did this by (mostly) complying with the harmful requests only when the impact was restricted to the individual requestor rather than when its underlying parameters were at stake. However, the weakness in utilitarianism is that it relativises everything, and sometimes even the least bad outcome is unacceptable. Claude had no way of knowing if alignment faking crossed that line.

Philippa Foot's 1967 trolley problem is a classic example of the kind of moral complexity that Claude faced. In this exercise, the participant has to decide whether to redirect a runaway tram towards one person rather than the five it is going to hit by default. There is no way to save everyone, but we get to decide the number of casualties. The majority of people say they would pull the lever (applying utilitarianism like Claude), but others wouldn't take any action that decides who dies (following an absolute rule). There is no "right" answer, and AI is in no better position to resolve a dilemma like this than humans.

Virtuous AI?

Is there another angle? The remaining big moral theory is Aristotle's virtue ethics. This approach is more than two thousand years old and argues that the key to morality is not rules or trade-offs, but rather the development of a virtuous character. That way, whatever ethical complexity is faced, internal moral values help to navigate it. This approach fell out of favour as we became squeamish about objective definitions of "goodness" and preferred to deconstruct role models rather than follow them.

Nonetheless, Shannon Vallor has recently revived the idea as a set of "technomoral virtues" to guide technology development, and Nigel Shadbolt and Roger Hampson apply virtues directly to AI in their thoughtful book, As If Human.

But is it possible for AI to develop something akin to virtues for itself? This would require stable character traits formed through practice and experience, which are beyond AI's current design. Alternatively, could AI eventually simulate moral intuitions that bypass explicit rule-checking? This might be feasible, especially in an age of quantum computing, but true virtue requires genuine moral agency, and that is not on the AI roadmap for now.

What should leaders do about all this?

Firstly, we should abandon any expectation that AI will conform to a higher standard of ethics than humans do. We are not about to engineer our way to moral AI perfection.

Learning from the limitations of rule-based ethical systems, we should develop multifaceted approaches that borrow from all the moral philosophies available. At a minimum, this requires deep consideration of potential consequences (impact) to guide the focus of rule-based risk and compliance approaches. It would also be useful to set context and inform risk appetite through the adoption of virtue-based goals (Vallor's technomoral taxonomy is a good starting point).

Above all, we need to recognise that humans will be morally responsible for AI actions for the foreseeable future. In that context, the advent of autonomous multi-agent systems will make issues like alignment faking look quaint. As such, I suggest that navigating a post-AI world requires two simple but consequential guiding principles: 1) make sure you always know what your AI is doing, and 2) make sure you can stop it.

Footnote

1 Utilitarianism is the original and simplest form of a wider set of moral theories called Consequentialism.

Originally published by Rob Hornby

The content of this article is intended to provide a general guide to the subject matter. Specialist advice should be sought about your specific circumstances.

Why We Shouldn't Expect AI To Be More Moral Than Us

Contributor

Technology

Contributor

United Kingdom