← Back to Projects

Mini Transformer Lab

From-scratch transformer implementation for architectural exploration

Focus: ResearchScope: PrototypeStack: PyTorch, PythonGitHub →

Why I built this

High-level APIs hide architectural trade-offs. I wanted a controllable environment to test transformer design choices and understand what breaks training and what scales.

What I learned

Modular design enabled rapid iteration and clear comparisons. Simplicity in architecture matters more than complex optimizations for understanding. What didn't work: initial over-parameterization slowed experiments.

TL;DR

Built a minimal transformer from scratch to study architectural trade-offs through controlled ablations. Enabled systematic experiments on depth, width, attention heads, and training dynamics.

Why this matters

High-level APIs hide architectural trade-offs. Understanding what breaks training and what scales requires controllable experiments.

Problem

High-level APIs hide architectural trade-offs. I wanted a controllable environment to test transformer design choices.

Constraints

Limited compute: experiments must run on consumer hardware. Reproducibility: all ablations must be deterministic and comparable. Educational clarity: code must be readable and well-documented.

System design

I built a minimal transformer with modular attention and FFN blocks. Ablations are config-driven. I can experiment with depth, width, and number of heads through config files. Training includes stability tools: gradient norm tracking, loss curve monitoring, and sanity checks. Evaluation uses controlled comparisons under fixed seeds and configs so results are reproducible.

Evaluation

I use synthetic datasets plus standard benchmarks. Metrics I track: training stability, attention patterns, and gradient norms. Failure modes I watch for: training instability, attention collapse, and scaling bottlenecks. Regression strategy: all ablations are compared under fixed seeds and configs so results are reproducible.

Results

Clearer understanding of what breaks training, what scales, and how architecture choices change attention patterns.

Trade-offs & Lessons

Understanding architecture trade-offs informs system design choices. Modular design enabled rapid iteration and clear comparisons. The initial over-parameterization slowed experiments. Simplicity in architecture matters more than complex optimizations for understanding.

What I'd Improve Next

Add more systematic ablation studies across attention mechanisms. Implement gradient analysis tools for training dynamics. Explore alternative positional encoding strategies.