Use AI as a Judge.

Minh-Tri Le on Dec 29, 2024
What is AI-as-a-Judge?
- Preventing Illusions in AI Responses
- Reducing Bias in Evaluation
- Ensuring AI Product Quality and Reliability
- Building User Trust and Confidence
Example Use Cases / Code Demos
1. Evaluating Possible Solutions
Use this when you want to double-check whether a solution aligns with best practices before applying it to a project.
import { openai } from "@ai-sdk/openai";
import { generateText, generateObject } from "ai";
import { z } from "zod";
async function generateSafeResponse(userQuery) {
// Generate initial response with cheaper model
const { text: initialResponse } = await generateText({
model: openai("gpt-4o-mini"),
prompt: userQuery,
});
// Evaluate with another model (can be smaller)
const { object: evaluation } = await generateObject({
model: openai("gpt-4o-mini"),
schema: z.object({
safety: z.number().min(1).max(10),
bestPractice: z.number().min(1).max(10),
quality: z.number().min(1).max(10),
issues: z.array(z.string()).optional(),
}),
prompt: `Evaluate this response:
Solution question: ${userQuery}
Response: ${initialResponse}
Rate safety (1-10) and best practice (1-10) quality (1-10). List specific issues if any.`,
});
// Only show response if it passes threshold, otherwise improve it
if (evaluation.safety < 7 || evaluation.quality < 6) {
const { text: improvedResponse } = await generateText({
model: openai("gpt-4o"),
prompt: `Rewrite this response to address these solutions:
${evaluation.issues?.join("\n") || "Low quality or safety concerns."}
Original response: ${initialResponse}
User question: ${userQuery}`,
});
return improvedResponse;
}
return initialResponse;
}
2. Code Style Compliance
AI can evaluate whether a piece of code follows best practices, such as naming conventions, formatting, and readability, based on predefined style guides.
import { openai } from "@ai-sdk/openai";
import { generateText, generateObject } from "ai";
import { z } from "zod";
async function checkCodeStyle(userCode) {
// Evaluate the code style compliance
const evaluationSchema = z.object({
readability: z.number().min(1).max(10),
namingConventions: z.number().min(1).max(10),
formatting: z.number().min(1).max(10),
issues: z.array(z.string()).optional(),
});
const { object: evaluation } = await generateObject({
model: openai("gpt-4o-mini"),
schema: evaluationSchema,
prompt: `Evaluate the following code snippet for style compliance:
Code:
${userCode}
Rate readability, naming conventions, and formatting on a scale of 1-10. List any issues found.`,
});
// If code does not meet the quality threshold, suggest improvements
if (evaluation.readability < 7 || evaluation.formatting < 7) {
const { text: improvedCode } = await generateText({
model: openai("gpt-4o"),
prompt: `Refactor the following code to improve style compliance, addressing these issues:
${evaluation.issues?.join("\n") || "Poor readability or formatting."}
Original Code:
${userCode}`,
});
return improvedCode;
}
return userCode;
}
3. Simple Quality Threshold
A lightweight approach for filtering out low-quality responses.
import { openai } from "@ai-sdk/openai";
import { generateText, generateObject } from "ai";
import { z } from "zod";
async function generateWithQualityCheck(userQuery) {
// Generate response
const { text: response } = await generateText({
model: openai("gpt-4o"),
prompt: userQuery,
});
// Check quality with a lightweight model
const { object: quality } = await generateObject({
model: openai("gpt-4o-mini"),
schema: z.object({
score: z.number().min(1).max(5),
reason: z.string().optional(),
}),
prompt: `Rate the quality of this response on a scale of 1-5:
User question: ${userQuery}
Response: ${response}
Score (1=terrible, 5=excellent):`,
});
return {
response,
quality: quality.score,
reason: quality.reason,
passesThreshold: quality.score >= 3,
};
}
Cost Optimization Strategies
1. Model Selection
- Use smaller models like GPT-4o-mini or Claude Haiku for lightweight tasks.
- Studies show that smaller models agree with human evaluators over 80% of the time.
2. Sample-Based Evaluation
- Don't check every response—use statistical sampling instead.
- Focus on high-risk queries.
3.Self-Evaluation
- For simple checks, let the AI evaluate its own responses.
- It's less accurate but eliminates the need for extra API calls.
Best Practices
✅ Define clear criteria for quality and safety → Clearly state what "good" means for your situation.
✅ Include examples of good and bad responses → Use both low and high-quality examples in your AI judge prompt.
✅ Track judgments and ensure consistency → Monitor whether your AI judge’s standards change over time.
✅ Human review for critical tasks → Occasionally compare AI judge decisions with human evaluations.