Predicting AI bias using SAEs

Published:

A comprative analysis of Sparse Autoencoders and MLP Activations using Linear Probing and Statistical Analysis to understand how they encode gender. Trained the SAE from scratch on Tiny-Stories-21M abd used a synthetic gender-annotated dataset to capture their activations at inference-time.

Tech Stack: Python, TransformerLens, SAELens, PyTorch

Detailed report and code here.