I build LLM systems that are measurable before they reach users.

LLM orchestration, evaluation infrastructure, and agentic systems in production.

I work at the intersection of LLM orchestration and evaluation infrastructure. At Pipedrive, I serve as LLM Orchestration Tech Lead: owning the GenAI stack architecture, building the evaluation systems that determine whether features ship, and designing the agentic workflows that run in production. My open-source projects, JudgeGuard and multiagent-eval, come directly from real production problems I've hit and solved.

How I think about production ML

Evaluation-first

If you can't measure it before it ships, you're guessing. Evaluation is a product requirement, not a post-launch task.

Bias before users see it

Judge models have biases. Position, verbosity, tone. Finding them in CI is an engineering problem, not an academic one.

Operability

Monitoring, failure modes, and regression gates from day one. Systems that fail loudly are easier to fix than systems that fail silently.

Want to talk about building reliable ML systems?

Let's discuss evaluation, system design, or production reliability. I'm always interested in learning from others working on similar problems.

Contact me