Welcome to the Neuroscience-Based Alignment research guide!
This track explores how insights from human brain processes—particularly neuromorality (how the brain handles moral decision-making and value encoding)—can inspire new AI architectures for better alignment with human values. We're not aiming for a perfect replica of the brain's mechanisms (we lack the data for that), but rather loosely inspired designs that mimic key principles.
You'll analyze existing neuroscience data, propose AI architectures or training methods, implement and test them using alignment evaluations and interpretability tools, and iterate based on results. You might even use computational methods like RL to refine our understanding of neuromorality from brain scans.
This guide is organized into clear sections to help you navigate the track. Start with the basics, then dive into frameworks, challenges, and resources. Approach this iteratively: Hypothesize, test, refine, and always critically evaluate your ideas.
We have limited, noisy data on brain activity during ethical decision-making, primarily from scans like fMRI and EEG. Researchers (e.g., Joshua Greene, Jonathan Haidt) have proposed theories explaining how moral values are encoded and processed. The goal here is to draw loose inspiration from these to design post-training or pre-training AI architectures that could mimic brain-like moral reasoning. For example:
This isn't about solving alignment entirely—it's about using the human brain (our one example of aligned general intelligence) as a creative starting point.
Unlike approaches that build AI from scratch to align with human values, this track reverses the lens: Study the human brain's moral systems and use that limited knowledge to inspire AI designs. Many humans pair general intelligence with morality—how can we learn from this to build safer AI? Key techniques include:
Focus on scalable, robust ideas that translate biological principles to silicon-based systems.