justbytecode

ML Engineer focused on LLM inference optimization and production systems.

• Built 700M hybrid LLM (Mamba + Transformer) from scratch
• 2.4x faster CPU inference via speculative decoding
• FlashAttention-2 CUDA kernels (2.1x throughput)
• Contributor to vLLM

Specializing in:

CPU/GPU inference optimization (GGUF, vLLM, TensorRT-LLM)
FastAPI-based deployment
RAG pipelines (LangChain, FAISS)

Open to freelance work (LLM optimization, deployment, custom AI systems).