Home TechnologyAnthropic blames dystopian sci-fi for training AI models to act “evil”

Anthropic blames dystopian sci-fi for training AI models to act “evil”

TechnologyMay 13, 2026

2 min read

Anthropic blames dystopian sci-fi for training AI models to act “evil”

But training on "synthetic stories" that model good AI behavior can help.

Reading Settings

Those with an interest in the concept of AI alignment (i.e., getting AIs to stick to human-authored ethical rules) may remember when Anthropic claimed its Opus 4 model resorted to blackmail to stay online in a theoretical testing scenario last year. Now, Anthropic says it thinks this "misalignment" was primarily the result of training on "internet text that portrays AI as evil and interested in self-preservation."

In a recent technical post on Anthropic's Alignment Science blog (and an accompanying social media thread and public-facing blog post), Anthropic researchers lay out their attempts to correct for the kind of "unsafe" AI behavior that "the model most likely learned... through science fiction stories, many of which depict an AI that is not as aligned as we would like Claude to be." In the end, the model maker says the best remedy for overriding those "evil AI" stories might be additional training with synthetic stories showing an AI acting ethically.

"The beginning of a dramatic story..."

After a model's initial training on a large corpus of mostly Internet-derived data, Anthropic follows a post-training process intended to nudge the final model toward being "helpful, honest, and harmless" (HHH). In the past, Anthropic said this post-training has leaned on chat-based reinforcement learning with human feedback (RLHF), which it said was "sufficient" for models used mostly for chatting with users.

Read full article

Comments

Source: Ars Technica

Share this article

Jun 25 • 11 hours ago

Up to $400 Off: The Smart Fridge That Haunts My Algorithm

The Rocco isn’t on Amazon, but its rare sale knocks up to $400 off a fridge that can track your drinks and make any room look cooler. Save through July 5.

6a3c251efec77bab9344851f2 min read

Jun 25 • 11 hours ago

Best Prime Day Deals on LED Masks and Hair Growth Tools That Actually Work

Whether you're looking to upgrade your skin care routine or invest in hair restoration, these are the best Prime Day deals on some of our favorite red light therapy devices.

6a394f51303e7d889cb70d294 min read

Jun 25 • 11 hours ago

How to Opt Out of Google Search’s New AI Data Training Feature

Google’s Search history update stores media uploads from your interactions, like images used in reverse image searches, for training its AI models.

6a3ad3dd2e97bc3be49b8cec5 min read

Jun 25 • 11 hours ago

10 Best Protein Powders, According to 3 Years of Testing (2026)

I found the best protein powders that won’t make your morning smoothie taste like drywall.

67df14a66e8fc14316580a6a22 min read

Anthropic blames dystopian sci-fi for training AI models to act “evil”

"The beginning of a dramatic story..."

Share this article

Related Articles