I build LLM systems that are measurable before they reach users.

LLM orchestration, evaluation infrastructure, and agentic systems in production.

I work at the intersection of LLM orchestration and evaluation infrastructure. At Pipedrive, I serve as LLM Orchestration Tech Lead: owning the GenAI stack architecture, building the evaluation systems that determine whether features ship, and designing the agentic workflows that run in production. My open-source projects, JudgeGuard and multiagent-eval, come directly from real production problems I've hit and solved.

Get to know me →

Specialisation

LLM Evaluation

LLM-as-judge, bias detection, rubric design

Architecture

Agentic Systems

LangGraph, multi-agent orchestration, RAG

Practice

LLMOps

CI/CD for ML, monitoring, production reliability

How I think about production ML

Evaluation-first

If you can't measure it before it ships, you're guessing. Evaluation is a product requirement, not a post-launch task.

Bias before users see it

Judge models have biases. Position, verbosity, tone. Finding them in CI is an engineering problem, not an academic one.

Operability

Monitoring, failure modes, and regression gates from day one. Systems that fail loudly are easier to fix than systems that fail silently.

See the full Production Playbook →

Where to next

Home is a starting point. If you want more detail, these pages go deeper.

About

Get to know me

The story behind my systems mindset: foundations, research, and production work.

Read About →

Projects

Want to see my work?

Featured projects, recent experiments, and tools, curated by intent, not GitHub noise.

Browse Projects →

Writing

Interested in what I write?

Short notes on evaluation, reliability, and production ML systems.

Read Writing →

Production ML Playbook →Browse Projects

Want to talk about building reliable ML systems?

Let's discuss evaluation, system design, or production reliability. I'm always interested in learning from others working on similar problems.

Contact me