projects
open source work — building in public from May 2026
gpt.cpp inference
GPT-2 124M forward pass implemented from scratch in C++ using Eigen. No training, no PyTorch — just raw matrix math, pretrained weights, and top-k sampling. pybind11 bindings expose the C++ forward pass to Python for tokenization.
C++ Python Eigen pybind11
github ↗ cuda-kernels gpu
CUDA kernels written from scratch — vector add, naive matmul, tiled matmul with shared memory. Each kernel profiled with Nsight Systems and Nsight Compute on an RTX 3050. Includes roofline analysis and the counter-intuitive L1 cache finding from tiled matmul.
CUDA C++ Nsight
github ↗