Predicting AI bias using SAEs

Published: May 12, 2025

A comprative analysis of Sparse Autoencoders and MLP Activations using Linear Probing and Statistical Analysis to understand how they encode gender. Trained the SAE from scratch on Tiny-Stories-21M abd used a synthetic gender-annotated dataset to capture their activations at inference-time.

Tech Stack: Python, TransformerLens, SAELens, PyTorch

Detailed report and code here.

Share on

Twitter Facebook LinkedIn

Pratik Doshi

Share on