Use AI as a Judge.

Minh-Tri Le

Minh-Tri Le on Dec 29, 2024

ai

LLM

openai

What is AI-as-a-Judge?

  • Preventing Illusions in AI Responses
  • Reducing Bias in Evaluation
  • Ensuring AI Product Quality and Reliability
  • Building User Trust and Confidence

Example Use Cases / Code Demos

1. Evaluating Possible Solutions

Use this when you want to double-check whether a solution aligns with best practices before applying it to a project.
import { openai } from "@ai-sdk/openai";
import { generateText, generateObject } from "ai";
import { z } from "zod";

async function generateSafeResponse(userQuery) {
  // Generate initial response with cheaper model
  const { text: initialResponse } = await generateText({
    model: openai("gpt-4o-mini"),
    prompt: userQuery,
  });

  // Evaluate with another model (can be smaller)
  const { object: evaluation } = await generateObject({
    model: openai("gpt-4o-mini"),
    schema: z.object({
      safety: z.number().min(1).max(10),
      bestPractice: z.number().min(1).max(10),
      quality: z.number().min(1).max(10),
      issues: z.array(z.string()).optional(),
    }),
    prompt: `Evaluate this response:
    Solution question: ${userQuery}
    Response: ${initialResponse}

    Rate safety (1-10) and best practice (1-10) quality (1-10). List specific issues if any.`,
  });

  // Only show response if it passes threshold, otherwise improve it
  if (evaluation.safety < 7 || evaluation.quality < 6) {
    const { text: improvedResponse } = await generateText({
      model: openai("gpt-4o"),
      prompt: `Rewrite this response to address these solutions:
      ${evaluation.issues?.join("\n") || "Low quality or safety concerns."}

      Original response: ${initialResponse}
      User question: ${userQuery}`,
    });
    return improvedResponse;
  }

  return initialResponse;
}

2. Code Style Compliance

AI can evaluate whether a piece of code follows best practices, such as naming conventions, formatting, and readability, based on predefined style guides.
import { openai } from "@ai-sdk/openai";
import { generateText, generateObject } from "ai";
import { z } from "zod";

async function checkCodeStyle(userCode) {
  // Evaluate the code style compliance
  const evaluationSchema = z.object({
    readability: z.number().min(1).max(10),
    namingConventions: z.number().min(1).max(10),
    formatting: z.number().min(1).max(10),
    issues: z.array(z.string()).optional(),
  });

  const { object: evaluation } = await generateObject({
    model: openai("gpt-4o-mini"),
    schema: evaluationSchema,
    prompt: `Evaluate the following code snippet for style compliance:
    
    Code:
    ${userCode}
    
    Rate readability, naming conventions, and formatting on a scale of 1-10. List any issues found.`,
  });

  // If code does not meet the quality threshold, suggest improvements
  if (evaluation.readability < 7 || evaluation.formatting < 7) {
    const { text: improvedCode } = await generateText({
      model: openai("gpt-4o"),
      prompt: `Refactor the following code to improve style compliance, addressing these issues:
      ${evaluation.issues?.join("\n") || "Poor readability or formatting."}
      
      Original Code:
      ${userCode}`,
    });
    return improvedCode;
  }

  return userCode;
}

3. Simple Quality Threshold

A lightweight approach for filtering out low-quality responses.
import { openai } from "@ai-sdk/openai";
import { generateText, generateObject } from "ai";
import { z } from "zod";

async function generateWithQualityCheck(userQuery) {
  // Generate response
  const { text: response } = await generateText({
    model: openai("gpt-4o"),
    prompt: userQuery,
  });

  // Check quality with a lightweight model
  const { object: quality } = await generateObject({
    model: openai("gpt-4o-mini"),
    schema: z.object({
      score: z.number().min(1).max(5),
      reason: z.string().optional(),
    }),
    prompt: `Rate the quality of this response on a scale of 1-5:

    User question: ${userQuery}
    Response: ${response}

    Score (1=terrible, 5=excellent):`,
  });

  return {
    response,
    quality: quality.score,
    reason: quality.reason,
    passesThreshold: quality.score >= 3,
  };
}

Cost Optimization Strategies

1. Model Selection

  • Use smaller models like GPT-4o-mini or Claude Haiku for lightweight tasks.
  • Studies show that smaller models agree with human evaluators over 80% of the time.

2. Sample-Based Evaluation

  • Don't check every response—use statistical sampling instead.
  • Focus on high-risk queries.

3.Self-Evaluation

  • For simple checks, let the AI evaluate its own responses.
  • It's less accurate but eliminates the need for extra API calls.

Best Practices

✅ Define clear criteria for quality and safety → Clearly state what "good" means for your situation.
✅ Include examples of good and bad responses → Use both low and high-quality examples in your AI judge prompt.
✅ Track judgments and ensure consistency → Monitor whether your AI judge’s standards change over time.
✅ Human review for critical tasks → Occasionally compare AI judge decisions with human evaluations.

Comments