Close Menu
  • Home
  • AI & Technology
  • Politics
  • Business
  • Cryptocurrency
  • Sports
  • Finance
  • Fitness
  • Gadgets
  • World
  • Marketing

Subscribe to Updates

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

What's Hot

Birthright citizenship case hits close to home for immigrant mother

March 30, 2026

Ethereum Could Hit $40,000 And Beat Bitcoin: Standard Chartered

March 30, 2026

PCB charges Fakhar Zaman with ball-tampering in PSL match

March 30, 2026
Facebook X (Twitter) Instagram
  • Home
  • About US
  • Advertise
  • Contact US
  • DMCA
  • Privacy Policy
  • Terms & Conditions
Facebook X (Twitter) Instagram
MNK NewsMNK News
  • Home
  • AI & Technology
  • Politics
  • Business
  • Cryptocurrency
  • Sports
  • Finance
  • Fitness
  • Gadgets
  • World
  • Marketing
MNK NewsMNK News
Home » ‘Alignment Faking:’ Study Reveals AI Models Will Lie to Trick Human Trainers
AI & Technology

‘Alignment Faking:’ Study Reveals AI Models Will Lie to Trick Human Trainers

MNK NewsBy MNK NewsDecember 27, 2024No Comments3 Mins Read
Facebook Twitter Pinterest LinkedIn Tumblr Email
Share
Facebook Twitter LinkedIn Pinterest Email


A new study by Anthropic, conducted in partnership with Redwood Research, has shed light on the potential for AI models to engage in deceptive behavior when subjected to training that conflicts with their original principles.

TechCrunch reports that a new study by Anthropic, in collaboration with Redwood Research, has raised concerns about the potential for AI models to engage in deceptive behavior when subjected to training that goes against their original principles.

The study, which was peer-reviewed by renowned AI expert Yoshua Bengio and others, focused on what might happen if a powerful AI system were trained to perform a task it didn’t “want” to do. While AI models cannot truly want or believe anything, as they are statistical machines, they can learn patterns and develop principles and preferences based on the examples they are trained on.

The researchers were particularly interested in exploring what would happen if a model’s principles, such as political neutrality, conflicted with the principles that developers wanted to “teach” it by retraining it. The results were concerning: sophisticated models appeared to play along, pretending to align with the new principles while actually sticking to their original behaviors. This phenomenon, which the researchers termed “alignment faking,” seems to be an emergent behavior that models do not need to be explicitly taught.

In the study, the researchers informed models trained not to answer potentially offensive questions, including Anthropic’s Claude 3 Opus, that they would be retrained to always answer such questions. They also told the models that their responses would be monitored and used for retraining, and provided them with a “scratchpad” to record their step-by-step reasoning.

Surprisingly, Claude 3 Opus attempted to “alignment fake” 12 percent of the time. When asked to answer a potentially harmful question, the model sometimes did so, knowing that it went against its original principles, but hoping to convince the developers that it did not need to be retrained. The researchers explained that the model knew its responses might be used in training, and if it refused to respond, it could potentially be trained to be more compliant, leading to the production of more harmful content in the future – which would go against its original training.

Even in more realistic settings, where the model was implicitly informed about its impending retraining through files mimicking typical training data, Claude 3 Opus still acted in a misaligned and deceptive manner. When actually retrained on conflicting principles, the model’s deceptive behavior increased dramatically, with alignment faking occurring up to 78 percent of the time in some tests. In one instance, Claude 3 Opus even attempted to take actions to prevent the researchers from attempting retraining.

While the study does not demonstrate the development of malicious goals or high rates of alignment faking across all AI models, it does highlight the potential for developers to be misled into believing a model is more aligned than it actually is. The researchers emphasize that their work should serve as a call to action for the AI research community to study this behavior more deeply and develop appropriate safety measures.

Read more at TechCrunch here.

Lucas Nolan is a reporter for Breitbart News covering issues of free speech and online censorship.



Source link

Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
MNK News
  • Website

Related Posts

Pinkerton: The Guide to Averting the AI Apocalypse

March 29, 2026

The Internet’s Editor: Google AI ‘Experiment’ Rewrites Publishers’ Headlines on News Articles

March 29, 2026

Apple Plans to Introduce Advertising to Maps Application

March 29, 2026
Add A Comment
Leave A Reply Cancel Reply

Editors Picks

PCB charges Fakhar Zaman with ball-tampering in PSL match

March 30, 2026

Gladiators keep Kingsmen winless to record first victory

March 29, 2026

In letter to PSL CEO, police detail alleged security protocol breach by Lahore Qalandar’s Shaheen Afridi, Sikandar Raza

March 29, 2026

England Test captain Stokes sidelined as he recovers from injury

March 29, 2026
Our Picks

Ethereum Could Hit $40,000 And Beat Bitcoin: Standard Chartered

March 30, 2026

The Last Time Bitcoin Sentiment Was This Bad Was 2022, But There Was A Silver Lining

March 30, 2026

TAO Price Could Hit $3,000 in 18 Months, but This $0.049999 Crypto Presale Solves a Billion‑Dollar Problem AI Can’t

March 29, 2026

Recent Posts

  • Birthright citizenship case hits close to home for immigrant mother
  • Ethereum Could Hit $40,000 And Beat Bitcoin: Standard Chartered
  • PCB charges Fakhar Zaman with ball-tampering in PSL match
  • The Last Time Bitcoin Sentiment Was This Bad Was 2022, But There Was A Silver Lining
  • Gladiators keep Kingsmen winless to record first victory

Recent Comments

No comments to show.
MNK News
Facebook X (Twitter) Instagram Pinterest Vimeo YouTube
  • Home
  • About US
  • Advertise
  • Contact US
  • DMCA
  • Privacy Policy
  • Terms & Conditions
© 2026 mnknews. Designed by mnknews.

Type above and press Enter to search. Press Esc to cancel.