Use a different model for LLM evaluation

Issue created by @yautja_cetanu
Comment 2 months ago →
🇬🇧United Kingdom yautja_cetanu
Comment 2 months ago →
🇬🇧United Kingdom yautja_cetanu
First commit to issue fork.
Comment 2 months ago →
🇨🇦Canada bisonbleu
Merge request !4Use a different model for LLM evaluation → (Merged) created by Unnamed author
Comment 2 months ago →
🇨🇦Canada bisonbleu
The current MR preserves the current Run Once operation and route and adds a new Choose Model operation and route, replicating the UX found in Test Group.
Comment 2 months ago →
🇨🇦Canada bisonbleu
Comment 2 months ago →
🇨🇦Canada bisonbleu
Here's a 30 sec video to illustrate.
Comment 2 months ago →
🇬🇧United Kingdom yautja_cetanu
I think this isnt quite what I meant although it is also good for doing individual tests. What I was thinking was something like.

When doing Prepare Agent Test Group. It shows:

Test Group
Drupal CMS - Content type Agent|
Select the test group you want to run.

Model
LiteLLM Proxy - Openai
Select the model you want to use for running the tests.

Use a different model for LLM evaluations
Toggle (Default off)

When turned on:

Evaluator Model:
LiteLLM Proxy
Select the model you want to use for evaluation the results of tests

-----

When the toggle is turned on, the model that the tests use to run the tests and the model the tests use to evaluatioe the results where applicable can be different
Comment 2 months ago →
🇨🇦Canada bisonbleu
Oh… I was looking in the wrong direction… Let's try to untangle things.

Tests and Test Groups are run using either the default model (as set in the Ai Settings) OR a preferred model as selected in the Prepare Agent Test Group Run form.

But this issue is about something else: when creating a test, at the bottom of the form, there a field labelled Agent Response LLM Test.

As the description reads: If filled in, this prompt will be tested against the agents response. This field makes it possible to evaluate the default/chosen LLM's response. For this it might be useful to select a different LLM, especially when the initial intent is to run test on a new or unknown model and then evaluating the results using a trusted model.
Comment 2 months ago →
🇬🇧United Kingdom yautja_cetanu
Yup! I noticed from the logs, that it was using 4o mini to run the LLM evaluation which I think was also the default model. It would also explain why it kept failing (because I think 4o mini wasn't clever enough)
Comment 2 months ago →
🇨🇦Canada bisonbleu
Looking at AI Logs, I can confirm that the default provider set for Chat with Tools/Function Calling is used for the evaluation; and this makes perfect sense.

Using the current MR and running the new Choose Model action and selecting a different model for running a test clearly illustrates this.

Alright, now I know where I'm going with this…
Comment 2 months ago →
🇬🇧United Kingdom yautja_cetanu
Hmmm then instead of the Boolean maybe we should make the second drop down default to the default provider as it is now ?
Comment 2 months ago →
🇨🇦Canada bisonbleu
Attaching a Mermaid flowchart of the workflow for clarity.
Pipeline finished with Success
about 2 months ago
#584556
Comment about 2 months ago →
🇨🇦Canada bisonbleu
How to test:

Install & Enable ai_agents_test;

Download test_basic_pages_test_group.yaml.txt and remove the trailing .txt;

Go to admin/content/ai-agents-test/group and import the .yaml file;

Go to admin/content/ai-agents-test/group

In the Actions, select «View Details»

Click Start a new run, this will let you choose the model to run the tests. It will also let you override the default model for the LLM evaluation;

Click Start a new test run;

→

Use a different model for LLM evaluation

Problem/Motivation

Steps to reproduce

Proposed resolution

Remaining tasks

User interface changes

API changes

Data model changes

Merge Requests

!4Use a different model for LLM evaluation
Merged

Comments & Activities

Use a different model for LLM evaluation

Problem/Motivation

Steps to reproduce

Proposed resolution

Remaining tasks

User interface changes

API changes

Data model changes

Merge Requests

!4Use a different model for LLM evaluationMerged

Comments & Activities

!4Use a different model for LLM evaluation
Merged