Decode LLM Quality - Eval Testing and Benchmarking LLMs: An Evaluation Guide

AI TDD Eval Testing

By Bhumika Shah, Dhruvin Kavathiya, on Tuesday, May 6, 2025

Our Journey with LLM Evaluation Frameworks: From Eval.ai to a DeepEval-LangSmith Hybrid Approach

"What gets measured gets improved." - Peter Drucker

Introduction

In the rapidly evolving landscape of Large Language Models (LLMs), ensuring the quality and reliability of model outputs isn’t just good practice-it’s essential. As our team embarked on building production-grade LLM applications, we quickly realized that robust evaluation frameworks would make the difference between mediocre and exceptional AI experiences.

This blog post chronicles our evaluation journey—from our initial steps with Eval.ai through our transition to DeepEval, and finally to our current hybrid approach combining DeepEval’s synthetic data generation capabilities with LangSmith’s comprehensive observability.

The Challenge: Evaluating LLM Outputs at Scale

Before diving into the tools, let’s frame the challenge we faced. Evaluating LLM outputs differs significantly from traditional software testing because:

Correctness is often subjective — What makes a response "good" can vary based on context
Outputs are non-deterministic — The same prompt can yield different responses
Multidimensional quality — Responses need to be evaluated across dimensions like relevance, factuality, helpfulness, and safety
Testing at production scale — Manual evaluation quickly becomes unsustainable

Our Initial Approach: Eval.ai

We started our evaluation journey with Eval.ai, a platform designed primarily for ML researchers and competition organizers.

What We Liked About Eval.ai

Well-established platform with support for both automatic and human evaluations
Leaderboard functionality for comparing different model variants
Support for multimodal evaluation

The Limitations We Encountered

Too many manual tasks required for routine testing
Primarily designed for academic challenges rather than production deployment
Limited integration options with our existing TypeScript codebase
Not optimized for LLM-specific evaluation metrics such as hallucination detection

After several sprints of struggling with these limitations, we knew we needed a more specialized solution for LLM evaluation.

Discovering DeepEval: LLM-Specific Testing

Our search led us to DeepEval, a framework specifically built for evaluating generative AI outputs with a focus on semantic quality and factuality.

What Drew Us to DeepEval

DeepEval offered specialized metrics that aligned perfectly with our RAG (Retrieval-Augmented Generation) implementation:

RAG-Specific Metrics

Answer Relevancy: How well does the response address the query?
Faithfulness: Does the response stick to the facts in the provided context?
Contextual Relevancy: Is the retrieved context relevant to the query?
Contextual Precision & Recall: How precise and comprehensive is the context selection?

Chatbot Metrics

Conversational G-Eval: Overall quality of conversational responses
Knowledge Retention: Ability to maintain context across a conversation
Role Adherence: Consistency with defined persona or role
Conversation Completeness & Relevancy: How thoroughly and relevantly queries are addressed

Agent Metrics

Tool Correctness: Proper use of available tools
Task Completion: Success in accomplishing requested tasks

The DeepEval Workflow

DeepEval implements test cases representing atomic interactions with your LLM application, scored using various LLM-as-a-judge techniques:

QAG (Question-Answer Generation)
DAG (Deep Acyclic Graphs)
G-Eval methodologies

The TypeScript Challenge

DeepEval proved to be an excellent evaluation framework, but we faced one significant challenge: our production system was built in TypeScript, while DeepEval is Python-based. This language mismatch led us to explore a hybrid approach.

Our Hybrid Solution: DeepEval + LangSmith

Rather than completely switching frameworks again, we developed a hybrid approach leveraging the strengths of both DeepEval and LangSmith.

How Our Hybrid Approach Works:

Use DeepEval for synthetic data generation

a. Leverage DeepEval's capabilities to generate diverse, high-quality test cases

b. Create "golden datasets" of input-context-expected output triplets
Implement testing and monitoring in LangSmith

a. Feed synthetic data into LangSmith for continuous evaluation

b. Track model performance and quality metrics in production
Maintain a feedback loop

a. Identify failure modes in LangSmith

b. Generate new synthetic test cases with DeepEval targeting those failure modes

LangSmith Benefits in Our Hybrid Approach

LangSmith brings several advantages to our hybrid evaluation system:

Feature	Benefit
Full LLMOps Lifecycle	Combines evaluation, observability, and dataset management
Tracing	Captures full traces of LLM chains and intermediate steps
Production Integration	Monitors live traffic with minimal overhead
Dataset Management	Organizes examples for evaluation and fine-tuning
TypeScript SDK	Seamless integration with our existing codebase

Results and Insights

After implementing our hybrid approach, we observed several significant improvements:

Testing Coverage: 4x increase in test coverage across different query types
Evaluation Speed: 70% reduction in time needed to evaluate new model versions
Issue Detection: Identified subtle hallucination patterns that went undetected with manual testing
Production Monitoring: Real-time alerts when metric thresholds are breached

Comparative Analysis: Evaluation Frameworks

To help others facing similar challenges, here’s our comparison of the main frameworks we explored:

Aspect	Eval.ai	DeepEval	LangSmith	Our Hybrid Approach
Purpose	Model benchmarking & competitions	LLM-specific evaluation	Full LLMOps lifecycle	Comprehensive testing & monitoring
Language	Python, Web interface	Python	TypeScript/Python	Python + TypeScript
Synthetic Data	Limited	Strong	Limited	Strong
Production Monitoring	No	Yes	Yes	Yes
TypeScript Support	No	No	Yes	Yes
RAG-specific Metrics	No	Yes	Basic	Yes
Integration Effort	High	Medium	Low	Medium

Lessons Learned

Our journey through these evaluation frameworks taught us several valuable lessons:

Match the framework to your tech stack: The language your evaluation framework uses should align with your production codebase.
Synthetic data is invaluable: Generating diverse, realistic test cases is essential for comprehensive evaluation.
Combine offline and online evaluation: Pre-deployment testing and production monitoring are both necessary.
Domain-specific metrics matter: Generic metrics don't capture the nuances of specialized applications like RAG systems.
Evolve your evaluation strategy: As your LLM applications mature, your evaluation approach should evolve as well.

What’s Next?

As we continue refining our evaluation approach, we're exploring several promising directions:

Behavioral testing: Developing more sophisticated adversarial examples
Human-in-the-loop validation: Selectively incorporating human feedback
Continuous learning: Using evaluation failures as fine-tuning examples
Multi-modal evaluation: Extending our framework to handle image and audio inputs

Conclusion

The journey from Eval.ai to our current hybrid DeepEval-LangSmith approach reflects the rapid evolution of LLM evaluation. By leveraging DeepEval’s synthetic data generation capabilities and LangSmith’s comprehensive observability, we’ve built an evaluation framework that supports our TypeScript codebase while providing robust, multi-dimensional quality assessment.

For teams building production LLM systems, we recommend exploring this hybrid approach—it combines the best of specialized evaluation techniques with practical integration into existing workflows and technology stacks.

Remember: in the world of LLMs, you can’t improve what you don’t measure, and you can’t trust what you don’t test.

How is your team handling LLM evaluation? We’d love to hear about your approaches and challenges in the comments below!