Iremsu Savas
Evaluation is not an afterthought. It is the contract between you and your users.
Building reliable, production ML systems across classical ML, GenAI, and MLOps.
My Journey
Foundations
My interest in machine learning began during my bachelor's in computer engineering around 2017–2018. While taking Andrew Ng's early Coursera courses, I realized that ML wasn't just another technical tool for me, it was a way of thinking about systems, uncertainty, and decision-making. That realization pushed me to go beyond coursework and continuously deepen my understanding of the field.
As my curiosity shifted from individual models to how they behave in complex, real-world environments, I decided to pursue a master's degree in Artificial Intelligence & Robotics at the University of Padua.
Research
At Padua, autonomous driving systems became my main focus. I joined a research lab working on perception for autonomous vehicles, where I was exposed not only to cutting-edge approaches, but also to a much harder question: how do we trust these systems under changing conditions?
This question shaped my master's thesis, which focused on building reproducible and systematic evaluation pipelines for lane detection. Through this work, I learned a lesson that has stayed with me ever since: strong models are not enough. Without rigorous, repeatable evaluation, performance numbers are fragile, misleading, and often irreproducible. Evaluation is not an afterthought, it is the foundation of reliable ML systems.
Production
After graduating, I moved into industrial machine learning at CNH Industrial, where I built perception systems for autonomous agricultural machinery. The gap between controlled benchmarks and real-world deployment became very concrete: models that looked strong in aggregate were quietly failing in dust, shadow, and edge conditions. I built condition-stratified evaluation datasets that made those failures visible, which directly changed deployment decisions.
Today, I work at Pipedrive as a Machine Learning Engineer and LLM Orchestration Tech Lead. I own the GenAI stack end-to-end: architectural decisions for the orchestration layer, agentic workflow design, RAG pipelines, and the evaluation infrastructure that decides whether features reach users. That evaluation infrastructure includes LLM-as-judge with rubric design, golden dataset management, inter-annotator agreement, and meta-evaluation to audit the judge itself. I discovered position bias in our judge model through permutation testing, fixed it, and open-sourced the methodology as JudgeGuard.
Most of the failures I've seen in production AI systems were not caused by weak models. The system didn't know when to abstain. The evaluation wasn't measuring what mattered. Edge cases weren't surfaced until users hit them. That's the problem I work on.
How I Think About ML Systems
Evaluation before optimization
Without rigorous, repeatable evaluation, performance numbers are fragile and misleading.
Systems over models
Machine learning only delivers real value when treated as an end-to-end system, not just a model.
Failure modes over benchmarks
Systems must fail visibly, not silently, and degrade gracefully when they can't meet their guarantees.
Observability over assumptions
The hardest problems aren't about model performance. They're about systems that detect failures early and give clear signals when something breaks.
Technical Focus
LLM Evaluation Infrastructure
LLM-as-judge systems with rubric design, bias detection, golden dataset management, and inter-annotator agreement. Meta-evaluation to audit the judge itself. Regression gates in CI/CD so evaluation is a product requirement, not a manual step.
Agentic Systems & LLMOps
Multi-agent orchestration with LangGraph, RAG architectures, prompt versioning, and observability via OpenTelemetry. I own the full lifecycle: architectural selection, POC, production deployment, and the operational practices that keep it reliable.
Production ML Systems
Classical ML to GenAI, always with the same discipline: evaluation gates before ship, monitoring and drift detection after, and failure modes instrumented from day one. Previously: edge inference optimisation for autonomous agricultural machinery at CNH Industrial.
Want to see how I apply these ideas?
Explore my projects, read about systems thinking, or check out my writing on building reliable ML systems.
Currently
ML Engineer at Pipedrive
Tallinn, Estonia