← Find more feeds

alignmentforum.org

AI Alignment Forum

Get the latest updates from AI Alignment Forum directly as they happen.

Follow now 18 followers

Latest posts

Last updated about 12 hours ago

Defining Monitorable and Useful Goals

about 12 hours ago

Published on July 15, 2025 11:06 PM GMTIn my most recent post...

Read full

Principles for Picking Practical Interpretability Projects

about 17 hours ago

Published on July 15, 2025 5:38 PM GMTThanks to Neel Nanda and...

Read full

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

about 18 hours ago

Published on July 15, 2025 4:23 PM GMTTwitter | Paper PDFSeven years...

Read full

Recent Redwood Research project proposals

1 day ago

Published on July 14, 2025 10:27 PM GMTPreviously, we've shared a few...

Read full

Narrow Misalignment is Hard, Emergent Misalignment is Easy

1 day ago

Published on July 14, 2025 9:05 PM GMTAnna and Ed are co-first...

Read full

Self-preservation or Instruction Ambiguity? Examining the Causes of Shutdown Resistance

2 days ago

Published on July 14, 2025 2:52 PM GMTThis is a write-up of...

Read full

Do LLMs know what they're capable of? Why this matters for AI safety, and initial findings

3 days ago

Published on July 13, 2025 7:54 PM GMTThis post is a companion...

Read full

Linkpost: Guide to Redwood's writing

6 days ago

Published on July 10, 2025 6:39 PM GMTI wrote a guide to...

Read full

The bitter lesson of misuse detection

6 days ago

Published on July 10, 2025 2:50 PM GMTTL;DR: We wanted to benchmark...

Read full

Evaluating and monitoring for AI scheming

6 days ago

Published on July 10, 2025 2:24 PM GMTAs AI models become more...

Read full

White Box Control at UK AISI - Update on Sandbagging Investigations

6 days ago

Published on July 10, 2025 1:37 PM GMTJordan Taylor*, Connor Kissane*, Sid...

Read full

What's worse, spies or schemers?

7 days ago

Published on July 9, 2025 2:37 PM GMTHere are two problems you’ll...

Read full

Or log in

Everything you care about in one place

AI Alignment Forum

Latest posts

Defining Monitorable and Useful Goals

Principles for Picking Practical Interpretability Projects

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

Recent Redwood Research project proposals

Narrow Misalignment is Hard, Emergent Misalignment is Easy

Self-preservation or Instruction Ambiguity? Examining the Causes of Shutdown Resistance

Do LLMs know what they're capable of? Why this matters for AI safety, and initial findings

Linkpost: Guide to Redwood's writing

The bitter lesson of misuse detection

Evaluating and monitoring for AI scheming

White Box Control at UK AISI - Update on Sandbagging Investigations

What's worse, spies or schemers?

Try Feeder for free