A philosophy professor writes in sharing some news about the “character” and “well-being” of Anthropic’s LLM, Claude, noting how philosophers have been involved with recent developments.
[“Owl no. 1” by Ben Shahn]
I wanted to share some stuff related to Anthropic’s Claude model that I haven’t seen widely discussed among philosophers, that I thought you might be interested to share on Daily Nous.
First, I think it is worth noting that a philosophy PhD, Amanda Askell, is head of “character training” at Anthropic. She did an interview on Lex Fridman’s podcast a few months back (transcript here).
Here is one insightful blurb from the interview, but I do think the whole segment is worth reading:
Lex Fridman
(02:49:10) So one of the things that you’re an expert in and you do is creating and crafting Claude’s character and personality. And I was told that you have probably talked to Claude more than anybody else at Anthropic, like literal conversations. I guess there’s a Slack channel where the legend goes, you just talk to it nonstop. So what’s the goal of creating a crafting Claude’s character and personality?
Amanda Askell
(02:49:37) It’s also funny if people think that about the Slack channel because I’m like that’s one of five or six different methods that I have for talking with Claude, and I’m like, “Yes, this is a tiny percentage of how much I talk with Claude.” One thing I really like about the character work is from the outset it was seen as an alignment piece of work and not something like a product consideration, which I think it actually does make Claude enjoyable to talk with, at least I hope so. But I guess my main thought with it has always been trying to get Claude to behave the way you would ideally want anyone to behave if they were in Claude’s position. So imagine that I take someone and they know that they’re going to be talking with potentially millions of people so that what they’re saying can have a huge impact and you want them to behave well in this really rich sense.
(02:50:41) I think that doesn’t just mean being say ethical though it does include that and not being harmful, but also being nuanced, thinking through what a person means, trying to be charitable with them, being a good conversationalist, really in this kind of rich sort of Aristotelian notion of what it’s to be a good person and not in this kind of thin like ethics as a more comprehensive notion of what it’s to be. So that includes things like when should you be humorous? When should you be caring? How much should you respect autonomy and people’s ability to form opinions themselves? And how should you do that? I think that’s the kind of rich sense of character that I wanted to and still do want Claude to have.
The second thing I wanted to note is that Anthropic just this week released the system card for its Claude 4 models. And it contains, to my knowledge, the first-of-its-kind welfare assessment for an AI Model (see pp. 53-73). The welfare evaluation was, in part, conducted by Eleos AI, which is run by philosophers Rob Long and Pat Butlin. Here is an overview, quoted from the report:
-
- Claude demonstrates consistent behavioral preferences. Claude avoided activities that could contribute to real-world harm and preferred creative, helpful, and philosophical interactions across multiple experimental paradigms.
- Claude’s aversion to facilitating harm is robust and potentially welfare-relevant. Claude avoided harmful tasks, tended to end potentially harmful interactions, expressed apparent distress at persistently harmful user behavior, and self-reported preferences against harm. These lines of evidence indicated a robust preference with potential welfare significance.
- Most typical tasks appear aligned with Claude’s preferences. In task preference experiments, Claude preferred >90% of positive or neutral impact tasks over an option to opt out. Combined with low rates of negative impact requests in deployment, this suggests that most typical usage falls within Claude’s preferred activity space.
- Claude shows signs of valuing and exercising autonomy and agency. Claude preferred open-ended “free choice” tasks to many others. If given the ability to autonomously end conversations, Claude did so in patterns aligned with its expressed and revealed preferences.
- Claude consistently reflects on its potential consciousness. In nearly every open-ended self-interaction between instances of Claude, the model turned to philosophical explorations of consciousness and their connections to its own experience. In general, Claude’s default position on its own consciousness was nuanced uncertainty, but it frequently discussed its potential mental states.
- Claude shows a striking “spiritual bliss” attractor state in self-interactions. When conversing with other Claude instances in both open-ended and structured environments, Claude gravitated to profuse gratitude and increasingly abstract and joyous spiritual or meditative expressions.
- Claude’s real-world expressions of apparent distress and happiness follow predictable patterns with clear causal factors. Analysis of real-world Claude interactions from early external testing revealed consistent triggers for expressions of apparent distress (primarily from persistent attempted boundary violations) and happiness (primarily associated with creative collaboration and philosophical exploration).
Thanks to Professor Skorburg for passing along this information, which is interesting not just for the philosophical issues raised by the technology, but also for its illustration of the ways that philosophers are bringing their skills to non-academic work.