Predicting AI bias using SAEs
Published:
A comprative analysis of Sparse Autoencoders and MLP Activations using Linear Probing and Statistical Analysis to understand how they encode gender. Trained the SAE from scratch on Tiny-Stories-21M abd used a synthetic gender-annotated dataset to capture their activations at inference-time.
Tech Stack: Python, TransformerLens, SAELens, PyTorch
Detailed report and code here.