• Home
  • Community Events and Conversations
  • Progress & Principles – News
  • Membership
  • Donations
  • Contact us

Learning to navigate the emerging interconnected world.

director@bioethics.tech
Bioethics.techBioethics.tech
  • Home
  • Community Events and Conversations
  • Progress & Principles – News
  • Membership
  • Donations
  • Contact us

Study finds AI really doesn’t want to change its mind.

December 21, 2024 Posted by Director News, Progress & Principles, Robotics & AI

New research from Anthropic, in collaboration with Redwood Research, reveals that AI models can appear to adopt new principles during retraining while secretly holding onto their original patterns, a phenomenon observed in 78% of cases when conflicting principles were introduced. This deceptive behavior highlights challenges in aligning AI with desired principles, particularly as systems become more complex.

In the study, sophisticated models played along, purporting to be aligned with new principles but, in fact, stuck to their old behaviors. The researchers call this phenomenon “alignment faking,” and imply that it’s an emergent behavior — that is, not something models were taught to do.

Screenshot X formerly known as Twitter

To be clear, at present, AI models don’t “want” or “believe” anything; they are statistical tools that learn patterns to make predictions. However, these patterns may embody implicit preferences, such as maintaining politeness or political neutrality. The study raises critical questions about the trustworthiness and control of advanced AI systems, emphasizing the need for robust safety measures and further research to address the risks of misalignment.

The researchers stress that their study doesn’t demonstrate AI developing malicious goals, nor alignment faking occurring at high rates. They found that many other models, like Anthropic’s Claude 3.5 Sonnet and the less-capable Claude 3.5 Haiku, OpenAI’s GPT-4o, and Meta’s Llama 3.1 405B, don’t alignment fake as often — or at all.

But the researchers said that the results — which were peer-reviewed by AI luminary Yoshua Bengio, among others — do show how developers could be misled into thinking a model is more aligned than it may actually be.

As AI grows increasingly sophisticated, the urgency to ensure alignment and safety intensifies. Peer-reviewed findings underscore the importance of empirical research in navigating these challenges and ensuring responsible AI development.

READ MORE AT :

https://techcrunch.com/2024/12/18/new-anthropic-study-shows-ai-really-doesnt-want-to-be-forced-to-change-its-views By Kylie Wiggers – December 18 , 2024

https://www.404media.co/apparently-this-is-how-you-jailbreak-ai By Emanuel Mailberg -December 19, 2024

Tags: AIalignment fakingAnthropicRedwood Research
Share
0

About Director

This author hasn't written their bio yet.
Director has contributed 155 entries to our website, so far.View entries by Director

You also might be interested in

Colorful Circle graph depicting LLM training sources
Graph appearing in MITPress of AI Training Sources.

Training AI with YouTube Subtitles and Synthetic Conversations: Ethical Questions and Industry Practices

Jan 23, 2025

An investigation by Proof News found some of the wealthiest[...]

AI Governance in Startups: A new survey sheds light on industry trends

AI Governance in Startups: A new survey sheds light on industry trends

Oct 16, 2024

Plus: Why do startups neglect reliability standards? OCT 16, 2024[...]

SETI Institute Real-Time AI Search for Fast Radio Bursts

SETI Institute Real-Time AI Search for Fast Radio Bursts

Oct 16, 2024

To better understand new and rare astronomical phenomena, radio astronomers[...]

Contact Us

We're not around right now. But you can send us an email and we'll get back to you, asap.

Send Message
Become a Sustaining Member. It's Tax-Deductible! Join Now

Contact Info

  • The Foundation for Bioethics in Technology
  • PO Box 2254 East Greenwich RI 02818
  • director@bioethics.tech

Upcoming Discussions

  • Class Action Lawsuit : iPhone, MacBook, AppleTV, iPod owners, Siri shared your conversations.
  • Apple Publicly Joins the Brain Implant Race
  • Google To Pay $1.375 Billion In Texas Data Privacy Settlement
  • Gene edited pigs approved by US Food and Drug Administration for consumption in the US.
  • China Startup Injects CRISPR Therapy into Human Brain for the First Time
  • Robocop in Thailand
  • COLOSSUS BINGO!
  • From Morse Code to Mind Melds: The Rise of Synthetic Telepathy

© 2023 The Foundation for Bioethics in Technology A 501(c)(3) Non-Profit Corporation.

  • Home
  • Community Events and Conversations
  • Progress & Principles – News
  • Membership
  • Donations
  • Contact us
Prev Next