Everything you care about in one place

Follow feeds: blogs, news, RSS and more. An effortless way to read and digest content of your choice.

Get Feeder

alignmentforum.org

AI Alignment Forum

Get the latest updates from AI Alignment Forum directly as they happen.

Follow now 23 followers

Latest posts

Last updated 1 day ago

My research agenda and work

1 day ago

This is a summary of the work I've done and work I...

Announcing the ARC White-Box Estimation Challenge

4 days ago

ARC has teamed up with AIcrowd to launch the ARC White-Box Estimation...

Testing Gemini models for scheming tendencies

8 days ago

As AI models become increasingly capable and autonomous, keeping them safely aligned...

Advice for making robust-to-training model organisms

9 days ago

We’d like to develop training techniques that work when applied to future...

Eval Cooperativeness May Be a Scalable Mitigation for Eval Gaming

10 days ago

Behavioral evaluations may become worthless, which we think would be a disaster...

Full automation of AI R&D probably yields a large speed up even without a software-only singularity

10 days ago

This is a somewhat technical note By "software-only singularity", I mean that,...

Looking for backdoors in Jane Street LLMs

15 days ago

I am going to talk about my experience in the Jane Street...

The Case for Evaluating Model Behaviors

17 days ago

Most evaluations of AI systems focus on their capabilities: how good they...

Risk reports need to address deployment-time spread of misalignment

22 days ago

Risk reports commonly use pre-deployment alignment assessments to measure misalignment risk from...

Mechanistic estimation for expectations of random products

22 days ago

We have developed some relatively general methods for mechanistic estimation competitive with...

The safe-to-dangerous shift is a fundamental problem for eval realism; but also for measuring awareness

23 days ago

1) The safe-to-dangerous shift is a fundamental problem for eval realismSuppose we...

Empowerment, corrigibility, etc. are simple abstractions (of a messed-up ontology)

26 days ago

1.1 Tl;drAlignment is often conceptualized as AIs helping humans achieve their goals...