Use a different model for LLM evaluation

Created on 14 August 2025, 15 days ago

Problem/Motivation

We will want to use AI Agent tests to test small models that may not work. But the LLM evaluator tests might start failing or become an unreliable inconsistent test if it uses the same model.

Steps to reproduce

Proposed resolution

  • In the form to run the tests where you can pick which model you want have a field that allows you to pick a different model for evaluating the tests.
  • There should be a checkbox (Use the same LLM for LLM evaluations). If it is unchecked you can select a new one, otherwise use the same model.
  • We need the drush command to run tests to take this an an argument.

Remaining tasks

User interface changes

API changes

Data model changes

πŸ“Œ Task
Status

Active

Version

1.0

Component

Code

Created by

πŸ‡¬πŸ‡§United Kingdom yautja_cetanu

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Merge Requests

Comments & Activities

  • Issue created by @yautja_cetanu
  • πŸ‡¬πŸ‡§United Kingdom yautja_cetanu
  • πŸ‡¬πŸ‡§United Kingdom yautja_cetanu
  • First commit to issue fork.
  • πŸ‡¨πŸ‡¦Canada bisonbleu

    The current MR preserves the current Run Once operation and route and adds a new Choose Model operation and route, replicating the UX found in Test Group.

  • πŸ‡¨πŸ‡¦Canada bisonbleu

    Here's a 30 sec video to illustrate.

  • πŸ‡¬πŸ‡§United Kingdom yautja_cetanu

    I think this isnt quite what I meant although it is also good for doing individual tests. What I was thinking was something like.

    When doing Prepare Agent Test Group. It shows:

    Test Group
    Drupal CMS - Content type Agent|
    Select the test group you want to run.

    Model
    LiteLLM Proxy - Openai
    Select the model you want to use for running the tests.

    Use a different model for LLM evaluations
    Toggle (Default off)

    When turned on:

    Evaluator Model:
    LiteLLM Proxy
    Select the model you want to use for evaluation the results of tests

    -----

    When the toggle is turned on, the model that the tests use to run the tests and the model the tests use to evaluatioe the results where applicable can be different

  • πŸ‡¨πŸ‡¦Canada bisonbleu

    Oh… I was looking in the wrong direction… Let's try to untangle things.

    Tests and Test Groups are run using either the default model (as set in the Ai Settings) OR a preferred model as selected in the Prepare Agent Test Group Run form.

    But this issue is about something else: when creating a test, at the bottom of the form, there a field labelled Agent Response LLM Test.

    As the description reads: If filled in, this prompt will be tested against the agents response. This field makes it possible to evaluate the default/chosen LLM's response. For this it might be useful to select a different LLM, especially when the initial intent is to run test on a new or unknown model and then evaluating the results using a trusted model.

  • πŸ‡¬πŸ‡§United Kingdom yautja_cetanu

    Yup! I noticed from the logs, that it was using 4o mini to run the LLM evaluation which I think was also the default model. It would also explain why it kept failing (because I think 4o mini wasn't clever enough)

  • πŸ‡¨πŸ‡¦Canada bisonbleu

    Looking at AI Logs, I can confirm that the default provider set for Chat with Tools/Function Calling is used for the evaluation; and this makes perfect sense.

    Using the current MR and running the new Choose Model action and selecting a different model for running a test clearly illustrates this.

    Alright, now I know where I'm going with this…

  • πŸ‡¬πŸ‡§United Kingdom yautja_cetanu

    Hmmm then instead of the Boolean maybe we should make the second drop down default to the default provider as it is now ?

  • πŸ‡¨πŸ‡¦Canada bisonbleu

    Attaching a Mermaid flowchart of the workflow for clarity.

  • Pipeline finished with Success
    about 8 hours ago
    #584556
Production build 0.71.5 2024