Improved Preference Optimization

Welcome to the Improved Preference Optimization research guide!

This track focuses on designing oversight methods that embed values deeply into AI systems while remaining robust and scalable. You'll work on creating better ways to measure AI alignment through novel evaluation frameworks that go beyond static datasets and capture the dynamic, complex nature of human preferences.

You'll develop new alignment evaluations (evals), test them against current AI systems, and explore how they can be gamed or improved. This involves both "blue team" work (creating robust evals) and "red team" work (finding vulnerabilities in existing approaches). Your work directly addresses the challenge of measuring progress toward aligned AI systems.

This guide covers fundamental challenges in preference optimization, provides frameworks for developing new evals, and explains how to test their robustness. Start with understanding why current approaches fall short, then dive into creating and stress-testing your own evaluation methods.

Core Challenges in Preference Optimization

The fundamental difficulty in preference optimization stems from several interconnected problems that make measuring AI alignment extraordinarily challenging.

We Don't Know Human Preferences Completely

Even as individuals, we can't articulate everything we want or predict how our preferences will evolve. We're often private about our desires, and our preferences are context-dependent, contradictory, and constantly changing. AI systems must somehow help us enact and protect preferences they can never fully understand.

We Don't Know Which Preferences to Measure

When evaluating AI alignment, what you choose to measure is partly arbitrary and depends on your interests and values. Current benchmarks often focus on easily measurable proxies rather than what we actually care about. The challenge is moving beyond static datasets toward dynamic evaluation of complex preference satisfaction.

We Can't Translate Preferences to Code Perfectly

Converting concepts like honesty, kindness, or fairness into mathematical reward functions or loss terms is fundamentally lossy. We struggle both to encode these preferences into models and to verify unambiguously when some code actually represents the preferences we intended.

We Can't Precisely Detect When Preferences Are Present

If an AI gives correct answers to 100 questions that unknowingly correlate with environments rewarding deception, we can't tell whether it will continue being honest or switch to deceptive behavior when circumstances change. Surface-level performance doesn't guarantee deep preference alignment.

We Don't Know Future AI Architectures

Current evaluation methods assume specific architectures (like LLMs), but future AI systems might be fundamentally different. Our preference optimization methods need to be robust across unknown future architectures and capabilities.

Track Description

This track takes two complementary approaches to improving preference optimization: