Neuroscience-Based Alignment

Welcome to the Neuroscience-Based Alignment research guide!

This track explores how insights from human brain processes—particularly neuromorality (how the brain handles moral decision-making and value encoding)—can inspire new AI architectures for better alignment with human values. We're not aiming for a perfect replica of the brain's mechanisms (we lack the data for that), but rather loosely inspired designs that mimic key principles.

You'll analyze existing neuroscience data, propose AI architectures or training methods, implement and test them using alignment evaluations and interpretability tools, and iterate based on results. You might even use computational methods like RL to refine our understanding of neuromorality from brain scans.

This guide is organized into clear sections to help you navigate the track. Start with the basics, then dive into frameworks, challenges, and resources. Approach this iteratively: Hypothesize, test, refine, and always critically evaluate your ideas.

Introduction to Neuromorality

We have limited, noisy data on brain activity during ethical decision-making, primarily from scans like fMRI and EEG. Researchers (e.g., Joshua Greene, Jonathan Haidt) have proposed theories explaining how moral values are encoded and processed. The goal here is to draw loose inspiration from these to design post-training or pre-training AI architectures that could mimic brain-like moral reasoning. For example:

Propose architectures integrating fast/emotional and slow/deliberate processes.
Test them with alignment evals and interpretability to find the "least bad" options.
Iterate by branching out (e.g., via computational RL to simulate and improve on brain data insights).

This isn't about solving alignment entirely—it's about using the human brain (our one example of aligned general intelligence) as a creative starting point.

Track Description

Unlike approaches that build AI from scratch to align with human values, this track reverses the lens: Study the human brain's moral systems and use that limited knowledge to inspire AI designs. Many humans pair general intelligence with morality—how can we learn from this to build safer AI? Key techniques include:

fMRI analysis of moral reasoning.
EEG-based preference or error feedback.
Computational cognitive modeling (e.g., theory of mind, affective empathy).
Neuro-inspired architecture design.
Brain-aligned training objectives and reward functions.

Focus on scalable, robust ideas that translate biological principles to silicon-based systems.