Improved Preference Optimisation

Welcome to the Improved Preference Optimization research guide!

This guide is designed to help you develop innovative methods for aligning AI preferences with human values. It's divided into two phases: Ideation (brainstorming and conceptualizing) and Implementation & Evaluation (building, testing, and refining).

We'll use a prompt-led approach, answer the questions in each phase one at a time, keeping your responses concise to build ideas step by step. This structured questioning helps foster creativity while staying focused.

At the end, you'll find general resources to support your work.

Phase 1: Prompt-Led Ideation for a New Preference Optimization Method

In this phase, focus on ideation: Explore internal model signals that could reflect AI preferences, brainstorm ways to interpret and intervene on them, and anticipate potential pitfalls. Answer the questions sequentially to build a solid conceptual foundation.

What is the most promising internal model signal you know (e.g., neuron activations, feature vectors, latent spaces)?

(Start here: Identify a key signal from your knowledge of AI models.)
How could that signal plausibly encode or reflect preferences rather than just outputs?

(Explain the link: Why might this signal capture underlying preferences?)
What existing evidence or literature supports that interpretation?

(Back it up: Cite studies or examples that align with your idea.)
Where could that signal be noisy or misleading? Name at least 2 failure cases.

(Be critical: Highlight risks like noise or misinterpretation.)
How could you establish causality between this signal and preference?

(Prove the connection: Suggest experiments or analyses to show cause-and-effect.)
How might this causality be flawed or falsified?

(Try to think of at least 2 experiments that could show this)
What alternative signals might be less noisy or more causally linked?

(Explore options: Compare and contrast with other potential signals.)
If you could only collect one additional measurement alongside the signal, what would it be and why?

(Enhance your data: Choose a complementary metric and justify it.)
How might you intervene on that signal directly (loss shaping, targeted fine-tune, representation surgery, activation patching)?

(Get practical: Describe techniques to modify the signal.)
How would you measure changes in the signal post-intervention?

(Track progress: Define ways to quantify shifts.)
What preference dataset or task would give the clearest signal-preference mapping?

(Test effectively: Recommend datasets or tasks for validation.)
How could you minimise confounding from capability changes or reward hacking?

(Avoid traps: Strategies to isolate preference effects.)
What’s the simplest proof-of-concept experiment to test your idea fast?

(Start small: Outline a quick, low-resource test.)