I build LLM systems that are measurable before they reach users.
LLM orchestration, evaluation infrastructure, and agentic systems in production.
I work at the intersection of LLM orchestration and evaluation infrastructure. At Pipedrive, I serve as LLM Orchestration Tech Lead: owning the GenAI stack architecture, building the evaluation systems that determine whether features ship, and designing the agentic workflows that run in production. My open-source projects, JudgeGuard and multiagent-eval, come directly from real production problems I've hit and solved.
How I think about production ML
Evaluation-first
If you can't measure it before it ships, you're guessing. Evaluation is a product requirement, not a post-launch task.
Bias before users see it
Judge models have biases. Position, verbosity, tone. Finding them in CI is an engineering problem, not an academic one.
Operability
Monitoring, failure modes, and regression gates from day one. Systems that fail loudly are easier to fix than systems that fail silently.
Where to next
Home is a starting point. If you want more detail, these pages go deeper.
The story behind my systems mindset: foundations, research, and production work.
Featured projects, recent experiments, and tools, curated by intent, not GitHub noise.
Short notes on evaluation, reliability, and production ML systems.
Want to talk about building reliable ML systems?
Let's discuss evaluation, system design, or production reliability. I'm always interested in learning from others working on similar problems.
Contact me