justbytecode
ML Engineer focused on LLM inference optimization and production systems.
• Built 700M hybrid LLM (Mamba + Transformer) from scratch
• 2.4x faster CPU inference via speculative decoding
• FlashAttention-2 CUDA kernels (2.1x throughput)
• Contributor to vLLM
Specializing in:
- CPU/GPU inference optimization (GGUF, vLLM, TensorRT-LLM)
- FastAPI-based deployment
- RAG pipelines (LangChain, FAISS)
Open to freelance work (LLM optimization, deployment, custom AI systems).