Created on 29 October 2024, about 2 months ago

Problem/Motivation

Create a default AI recipe that downloads and installs the AI module and sets up AI Agents, and download OpenAI and Anthropic provider (maybe more added later)

Steps to reproduce

Proposed resolution

Remaining tasks

User interface changes

API changes

Data model changes

📌 Task
Status

Active

Component

Track: AI

Created by

🇩🇪Germany marcus_johansson

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Merge Requests

Comments & Activities

  • Issue created by @marcus_johansson
  • Pipeline finished with Failed
    about 2 months ago
    Total: 2543s
    #325040
  • Pipeline finished with Failed
    about 2 months ago
    Total: 526s
    #325085
  • Pipeline finished with Failed
    about 2 months ago
    Total: 540s
    #325093
  • Pipeline finished with Failed
    about 2 months ago
    Total: 1293s
    #325112
  • Pipeline finished with Failed
    about 2 months ago
    #325141
  • Pipeline finished with Success
    about 2 months ago
    Total: 972s
    #325156
  • Pipeline finished with Failed
    about 2 months ago
    Total: 573s
    #328801
  • Pipeline finished with Failed
    about 2 months ago
    Total: 765s
    #328811
  • Pipeline finished with Success
    about 2 months ago
    Total: 977s
    #328821
  • Merge request !173Issue #3484307: Add AI recipe → (Merged) created by marcus_johansson
  • Pipeline finished with Success
    about 2 months ago
    Total: 2151s
    #328835
  • 🇬🇧United Kingdom yautja_cetanu

    We will likely want to make use of this issue: https://www.drupal.org/project/drupal_cms/issues/3482992 The installer should collect input for recipes that need it Active

    It will likely be important to decide:

    - If you want AI at all in Drupal CMS, its there to help people very new to Drupal and so its possibly too much to ask for them to find AI in a project browser, etc.
    - Similarly with API keys and providers this will be important before any of the agents work for someone.

    We will need some way of exploring what the UI could look like for the rest of the Starshot team to look at.

  • Pipeline finished with Canceled
    about 2 months ago
    Total: 182s
    #329036
  • Pipeline finished with Canceled
    about 2 months ago
    Total: 182s
    #329037
  • Pipeline finished with Failed
    about 2 months ago
    Total: 749s
    #329038
  • Pipeline finished with Failed
    about 2 months ago
    Total: 1779s
    #329039
  • Pipeline finished with Failed
    about 2 months ago
    Total: 194s
    #329071
  • Pipeline finished with Failed
    about 2 months ago
    Total: 639s
    #329072
  • Pipeline finished with Failed
    about 2 months ago
    Total: 869s
    #329083
  • Pipeline finished with Success
    about 2 months ago
    Total: 1395s
    #329084
  • 🇺🇸United States phenaproxima Massachusetts

    Didn't manually test, but overall this seems clean and straightforward to me. There are a few small clean-up items and nitpicks, but in general I don't see much reason to block this.

  • 🇬🇧United Kingdom catch

    I have not reviewed the MR/modules it adds or tried it out, however I have written up some notes based on the AI demo in the Driesnote from DrupalCon Barcelona - on the basis that the Driesnote demo was showing the current state of the integration at its best.

    First of all I should say that the demo exceeded my (admittedly very low) expectations for what was possible. However I also think there are serious problems which are hidden by the apparent success of the demo.

    If we break down the demo, the user gives the AI agent some prompts and then the AI agent gets 'verbal' confirmation and makes changes to the site configuration (and in some cases content).

    1. Create or modify entity bundles/fields/views via mapping
    2. Create content based on some parameters
    3. Migrate data from the a website (via scraping the public site)

    Overall, it 'works' in that the AI agent does the things that it is asked to do, without going completely off the rails (although I wonder how much prompt-crafting was involved in that), but there are still some fundamental site building mistakes in here.

    Things that went wrong:

    • When the agent is asked to create an image field, it creates an image field. On the surface that seems great, however, you would actually want a media field here for re-use etc. (especially for wine tour illustrations).
    • There is no choice given, nor is there any indication to the user that media fields exist in this interaction at all because the entire model is completely hidden. It's also mis-matched with the media widget that's demonstrated in ckeditor5 (which is correctly set up because it's in the recipe).
    • When the agent is asked for a (taxonomy) field with a list of wine regions, it creates a taxonomy field with a list of wine regions using the select widget. Once again on the surface this seems great, but it's the wrong choice. It's extremely rare that you want to pregenerate a fixed list of choices - e.g. does this company really have events in all of those twenty regions and nowhere else?
    • Instead of a fixed list, you can use the autocomplete + create widget, and then add wine regions as you go. The problem with the fixed vocabulary is that when you need to add a region that's not there (quite likely given climate change), the only way to do so would be to manually add it to the vocabulary. This would either involve finding out how to do that (without having ever visited that page), or asking the AI agent again. Tagging avoids that extra manual step, but it wasn't offered as an alternative, the AI agent just did what it was told. When you actually use the taxonomy UI, you get these choices and can experiment.
    • With the view of wine tours and filtering, because the taxonomy was pre-filled, the view will then show empty taxonomy terms to filter on. When you filter on an empty taxonomy term - no results. With auto-creation, much less likely to have empty terms. Obviously in the demo, the term with some wine tours in was selected, but all the other ones were empty.
    • The AI agent gives the user a text description of what it's about to do, but there is no preview. This means if the user doesn't understand the implications of the text description, it could be destructive and potentially hard to reverse. It would probably be possible to do previews with workspaces + workspaces config, which would mean a workspaces + workspaces config dependency for the AI agent then.
    • The migration pulls in the content via scraping the public website. If the site is e.g. Wordpress that already has an export format and https://www.drupal.org/project/wordpress_migrate supports it. The advantage of migrate is it is repeatable and tweakable, and you can roll it back and try again after fixing something - scraping the site and putting it through an LLM will not be. Maybe the AI agent can create and edit migrate templates (based on recipes like a wordpress migrate one), but these are code, not config, unless you use various contrib modules.

      On top of this, with the image (media) field and taxonomy, these should both be parts of the events recipe anyway. Let's say the wine tours website wants blog posts as well as wine tours, and it wants to tag them with the same thing. This would be accomplished by the blog recipe and the events recipe using a common 'tags' field. By using a common 'tags' field you'd be able to list blog posts and events on term listing pages together.

      Similarly, by using a predictable media field, you'd be able to have hero/card SDCs/view modes for taxonomy and landing pages.

      This also brings up the question of what would happen if the events recipe hadn't been installed - would the AI agent suggest installing the events recipe, or would it start creating an events content type + views etc. from scratch?

      The event content type may or may not have tagging, media image field etc. yet because it's not stable, but these are gaps that would need to be filled by the recipe.

  • 🇬🇧United Kingdom catch

    More concerns from looking at the code base - just a very quick look though and some of these might be misinformed/mistaken:

    1. The taxonomy operation include delete, and I found a permissions check on the current user for 'administer taxonomy'. Do I understand correctly that the AI agent operates with the same permissions that the current user has and if so that it can take destructive operations (like deleting a vocabulary) on the user's behalf. Or if it doesn't, how are those permissions set up and managed?

    2. I was not able to find where the text for the confirmation message from the AI agent comes from. e.g. the "I've created the vocabulary you asked for with the top wine regions" stuff. Therefore, I'm assuming that it's generated by the LLM itself based on the initial prompt. However, LLMs are not deterministic by design, so it seems possible that what the AI agent actually creates and what it reports that it creates could be different?

    #2 is potentially solveable by removing that step entirely and linking to a workspace instead but I would like to know if I've understood that bit correctly or not - and if I have then I think it makes a real preview even more critical since the LLM could potentially misinform about what it's done or about to do.

  • 🇦🇺Australia pameeela

    @marcus_johansson @yautja_cetanu do you have any guidance on what things we should be looking at or testing? I tried it out with various site building prompts and got some weird results. Is there particular stuff that is working well or other stuff we should avoid?

    I tested the prompt from the demo to create a taxonomy with the top 20 wine regions in Europe. Although it reported that it created these as terms (with Bordeaux twice, with an odd note about this), it actually created one term "Top 20 Wine Regions in Europe".

  • 🇧🇪Belgium Dries

    Good feedback. I propose we explore four things:

    1. Build a prompt library: Can we create a collection of prompts with their expected outputs? This library would serve as a baseline to automatically test the performance and stability of AI agents. If a prompt fails in the real world, we can add it to the test library. While AI agents aren't fully deterministic, using some kind of fuzzy matching might work. This would create a good "feedback loop" that allows us to improve the accuracy of AI agents and it could be used to assess release readiness and quality of the AI agents.
    2. Make sure there is a "User review step": We should ensure that users can review what the AI agent will do before it takes action. Is this something we can enforce? While a review step isn't foolproof — users might not always know if the proposed solution is correct or best — it provides an opportunity to catch errors.
    3. Decide on the best permission policy: Today, the AI agents are bound to permissions. We can already disable permissions to prevent AI agents from making disruptive mistakes. However, we can debate the exact policy. AI agents could either be limited by the permissions of the current user or they could have their own permissions, possibly a subset of the users' permissions? We could discuss if the AI agent should inherit the user's permissions or if it should have additional restrictions?
    4. Explore rollback functionality using Workspaces: Can we use Workspaces to add rollback functionality? The AI module could create an automatic workspace, similar to how a content approval workflow might, giving users the ability to roll back changes. I'm not sure this is feasible, but if it is, it would offer an additional layer of control. This would probably take a lot of UX and development work.

    I'd love to see (1) and (2) explored first, with (3) and (4) as a potential future additions. Option (3) may become unnecessary depending on the success of (1) and (2). Finally, I believe (4) can wait, as it may require significant development effort and comes with its own limitations. Option (1) seems the most important, and option (2) can only help.

    To keep things in perspective, it's good to remember that AI agents will make mistakes — just like humans do when using our "manual UIs". The challenges AI agents bring aren't always new, nor are they always best addressed within the AI agents themselves.

    For example, over the years, I've manually created many content types with various fields and made plenty of errors along the way. I've created fields, added content, only to realize I used the wrong field type or misconfigured it. Fixing these mistakes often required deleting fields and starting over, or, worse, "massaging" them with manual database queries. We've all been there, I think. Many new users face these issues today with our manual UI. It's not new, per se.

    Two thoughts flow from this:

    • Yes, an AI agent without a review step could potentially create more problems. It's why we should explore adding a "Review step". But the opposite might be true as well; an AI agent might actually reduce mistakes, as it has more and better knowledge than new users ... We might find that a "Review step" is mostly for expert users that know how to validate an AI's actions.
    • For tasks like adding fields, an alternative solution could be to make it easier to fix mistakes, such as by allowing users to "transform" a field's type or configuration after it was created and used. Not easy to implement, of course, but it would help people making mistakes with either the AI agent or manual UI.

    Food for thought!

  • 🇬🇧United Kingdom catch

    #9 is a considerably worse scenario than I anticipated in #7, it means the current 'verbal' review step does not necessarily match the reality of what is done or reported to be done at all.

    @Dries

    We should ensure that users can review what the AI agent will do before it takes action. Is this something we can enforce? While a review step isn't foolproof — users might not always know if the proposed solution is correct or best — it provides an opportunity to catch errors.

    A combination of workspaces, wse_config (pre alpha so would need work, but also likely to be wanted for experience builder once it handles site chrome like editing the page title via XB, there's an issue about this already somewhere) and trash module would allow staging of both content and config entities, as well as deletions, to a workspace with full preview.

    You could then approve, reject, or modify those changes before they affect the live site.

    Rollback is also possible although at the moment only linearly (i.e. you can undo the last change, but not the last change but one unless you roll back both, but this gives you a second opportunity to back out at least even if you publish the workspace). I think rollback is currently also only in https://www.drupal.org/project/wse although more stable than wse_config. The workspaces talk from Drupalcon Barcelona shows all of this in action if people aren't familiar with it.

    This should stop most damage being done, although if the AI agent can uninstall modules, pretty sure there's no way to stage that - this should probably be completely prevented and a link provided to click on to do it yourself instead.

    To actually make the preview more meaningful, links to specific pages to review would be good - e.g. if the AI agent changes a view, link to that view, if it creates a vocabulary and terms, link to the vocabulary admin page. Could go even further here and generate a tour with links.

    For example, over the years, I've manually created many content types with various fields and made plenty of errors along the way. I've created fields, added content, only to realize I used the wrong field type or misconfigured it. Fixing these mistakes often required deleting fields and starting over, or, worse, "massaging" them with manual database queries. We've all been there, I think. Many new users face these issues today with our manual UI. It's not new, per se.

    This is true, but you also learned yourself each time you did this. Someone relying on an AI agent does not go through the process themselves, so e.g. as in my first comment here, won't even see the choice between image and media fields to realise they might have made a different choice. The expectation is always going to be higher that the system does something correctly when it does it 'for you' than if you do it yourself.

    The good news is that the vast majority of the hard work has already been implemented in workspaces and it's mostly stabilising a couple of things and linking it up, which are also needed for other Drupal CMS-related initiatives. To get there though the work needs to start now, even if that solution obviously won't be ready for the 1.0 CMS release, so that it's ready as soon as possible.

    I personally think that if you can get in a situation whether the AI agent tells you it made 20 taxonomy terms, but when you go to check it actually made one taxonomy term with the label '20 taxonomy terms', that people will immediately (and correctly) blame the product for this - no human would ever make the same mistake and then 'lie' about what they've done.

  • Pipeline finished with Failed
    about 2 months ago
    Total: 435s
    #332066
  • Pipeline finished with Canceled
    about 2 months ago
    Total: 440s
    #332067
  • Pipeline finished with Failed
    about 2 months ago
    Total: 514s
    #332072
  • Pipeline finished with Failed
    about 2 months ago
    Total: 843s
    #332073
  • Pipeline finished with Success
    about 2 months ago
    #332154
  • Pipeline finished with Success
    about 2 months ago
    #332153
  • 🇩🇪Germany marcus_johansson

    @phenaproxima - the changes requested should have all been looked at and fixed

  • 🇩🇪Germany marcus_johansson

    @pameela - I have done some major changes lately to the stability of the Field Agent so that each of the storage settings, config setting, form settings, display settings are its own mini-agent, however I don't think they are related or fixes the issue you are seeing since that is the Taxonomy Agent.

    The following works well for me using OpenAI as the provider most of the time out of the box after installing it:

    On the page content type, could you create a category called "Wine Regions" that you should be populated with the 5 most famous Wine Regions in Europe and create a field for it called "Wine Regions"?

    Could you provide the prompts you wrote and which provider and I'll see if I can replicate it and figure out what is going wrong?

  • 🇩🇪Germany marcus_johansson

    I have done some general improvements to the stability of the Field Agent, so it can handle complex requests with many instructions on different parts of the field creation process, for instance this works most of the time:

    "Can you create an image field on the Article and call it "Marcus Images". I want it to store it secretly and I want to be able to crop the image after I uploaded it on the form. I also don't want to add any metadata like alt text when uploading it. I need you to make sure that only images over full hd is allowed and only jpeg-images that is under 5 mega bytes. When you watch the image as a normal user I want it to have a 16:9 ratio and if you click on it you see the original image. Also remove that stupid title when using the image."

    It is working similar to how you would setup a real field:

    • Check if the entity (and bundle) exists.
    • Check so the field doesn't already exist.
    • Check so a matching field type exists
    • Figure out if it needs specific field storage settings or save as default.
    • Figure out if it needs specific field config settings or save as default.
    • Figure out if it needs specific field form settings or save as default.
    • Figure out if it needs specific view display settings or save as default.
  • 🇬🇧United Kingdom yautja_cetanu

    Thanks for your review and glad it exceeded your admittedly low expectations! I'll try and reply to all your concerns.

    • Firstly whilst there was some prompt-crafting for the demo we tried to craft in the opposite direction by making it harder for AI and closer to what a Sarah persona would say. But as you can imagine it's much harder to do something real in the wild than in a demo and so we will be, starting from next week start testing the Agents with real people fitting the persona and report back with statistics so we can know what happens without any prompt crafting.
    • We're opting to focus on Drupal CMS release to have a small number of agents working really well rather than all the agents. Views and Migrate won't be in scope for 1.0
    • For the Demo we had to do some of the most prompt-crafting for taxonomy. As of this morning we have significantly improved the field agents and also added all the features for the core fields including configuration and display.
    • As Drupal CMS developers we are likely going to have to start building up documentation of Drupal best practices as you described with tagging. We tested with Claude and found if you asked it "In Drupal should I use a select list or Taxonomy" it was quite good at explaining how to decide.With specific examples of "wine regions" vs "rough expense", it was good at selecting taxonomy for one and select list for the other. However, going forwards, we shouldn't rely on a model's internal knowledge especially if we want to use smaller opensource models. So this will be something we need to work on.
    • To add further to the complication, I don't think these things are fully agreed on by the Drupal community. With all our clients we would never enable tagging in the way you've suggested as the information architecture is too important. We have tended to use "Other + Suggest" so that a member of staff can choose to add it or not. Now I'm not suggesting my view should be the default in Drupal CMS. But decisions like this likely need to be owned, thought through and tested by someone to mould the vision of Drupal CMS vs Drupal core
    • Similarly we can prompt the agents to present options to Sarah as you've suggested. One issue we had during the demo was for Wine Regions it would regularly create the vocabulary but not attach the field to the entity. So we've prompted the Agent to always try and attach it to an entity every time. If it can't work it out, it asks the user if its sure it wants a vocabulary not attached to something. The issue is that inherently asking the end-user to many questions ruins a lot of the usability. So we have to decide what level we want.
    • These prompts are all stored in YAML and so can be overridden on a site by site basis. Our plan is to refactor agents at some point to make it easier to edit the prompts in the UI and to make it easier to understand the order and flow of them for site builders. It might be possible as well for us to build a UI to create "Tools/ actions" (The abilities the agents have on the site) through the UI. It's a question of priorities.
    • Media is much more complicated to use than image fields, even as a seasoned Drupal site architect find it difficult to know when to use one or another. I think if we can make the image field work well we can port our work to a specific Media in Drupal CMS agent for images. But because of the flexibility of what Media can do it will be a lot of work to make an Agent that can handle every possible configuration of media entities. (Note: Marcus has told me that since yesterday it will use Media if you ask if “Should I use Media or Image” and then choose it. But more work will need to be done for this)
    • We are working on Agents and Workspaces, we believe it's one of the main new features we think we need for release. Hope to have something to show early next week. We have the ability for agents to rollback specific actions, but workspaces will allow a number of actions. Similarly we initially had something called blueprints that would show the end-user the YAML of the actions it will take and have them click approve before it implements it. It's still in the agents module but no UI for it in the chatbot yet. I think this will appeal to developers more than marketeers/ site builders.
    • As previously stated, migrate will not be in version 1.0. I've written a lot about this and we've done a lot of experiments and made some demos of one click full wordpress migration (Theme, design, layout, content types, content, everything). We think in the short term tools to help sitebuilders speed up migrations considerably are more likely to be successful than a true magical one click migration for reasons you've stated. I can go into detail elsewhere when we release our migrate agents properly.
    • Re: your question about events recipe, it's up to us to decide. Long term I would like the Agents to work with the project browser to suggest starting points and even create its own recipes. (For example, for adding reviews features, it could find the reviews module and configure it for wine tours). We have started exploring it on http://askdrupal.com but its not for version 1.0

    Re: Your questions on the codebase

    • My plan is that AI Agents will have a role assigned to it. The user will also have a role. So it will perform the task only if both the Agent and the User have permission. However there are a couple of issues with it (Do I want to have agent roles on the same page as user roles?). So for now we will remove the abilities from the agents that they shouldn’t have permissions for in Drupal CMS. We will then build Agent specific permissions into the ai agent module.
    • This is a good look at what agents are: https://github.com/openai/swarm . The agents in Drupal were built before this was released so there are differences (We use Drupal to orchestrate the workflow rather than agents do it themselves for example). But at their heart agents are Instructions (Prompt) and tools that do things where tools can also handoff a task to another agent. The tools are coded and either link to a specific drupal function or bunch of them. So we will/ already have removed the ability for agents to delete. They don’t just get full reign of everything Drupal does.
    • Re: the delete. We had a demo to make people feel more comfortable about this where the agent would ask to delete something. It would try, find it has no permissions, tell the user and then help the user do it themselves. We removed that demo last minute but it's why the code was still there.
    • You are correct that confirmation messages are provided by the LLM and there is a chance it can be different. In fact we might actively want it to be different as we want the LLM to use plain english instead of specifically Drupal terms. I think Workspaces can help. Whilst the response of an LLM is non-deterministic and repeatable (Although that changes with a temperature of 0 which we could set for Drupal CMS). There are things they are more or less likely to get wrong. It’s unlikely the LLM will report what its done differently to what its actually done.

    If you’re interested in a deeper dive or more questions about this Catch me and Marcus are around on slack for a chat or huddle! We are focusing on this stuff more or less full-time until Drupal CMS release.

    This isn’t finished yet but we also have a roadmap here: https://www.drupal.org/project/ai/issues/3485451 🌱 [Meta] Path to rc1 Active for the underlying AI module to get to version 1 (Even if not all the modules are used in Drupal CMS).

  • 🇬🇧United Kingdom catch

    Thanks for the responses, short on time here so just addressing the one point for the moment:

    You are correct that confirmation messages are provided by the LLM and there is a chance it can be different. In fact we might actively want it to be different as we want the LLM to use plain english instead of specifically Drupal terms. I think Workspaces can help. Whilst the response of an LLM is non-deterministic and repeatable (Although that changes with a temperature of 0 which we could set for Drupal CMS). There are things they are more or less likely to get wrong. It’s unlikely the LLM will report what its done differently to what its actually done.

    In Pam's testing above that's what happened, embedding the screenshots from #9 here for easy reference:

    This is probably the first time the AI recipe has been tested by someone not directly working on it, understandable given it was only posted a few days ago. So it sounds like it happened at least once within one or two sessions of using the AI agent, which for me is a very high rate?

  • 🇬🇧United Kingdom yautja_cetanu

    I replied to Catch before I saw your message Dries as it was a long reply! Very much agree with most of it.

    1. Yes I think we need that. We're starting it here: https://www.drupal.org/project/ai_evaluations and hope to get a demo of it working on Monday. We will then need to export the prompts and what they do and store them publically somewhere. We also will be speaking to the Drupal CMS privacy track to see how we can do this securely with GDPR in mind. This can help us improve the prompts but also eventually fine-tune an opensource model.
    2. It is something we can enforce and there are a number of ways of doing this. It does make everything much less fluid though and what if there are multiple steps the AI needs to do?
    3. Agreed, for now I think we just remove features we don't want in Drupal CMS. We should do Agent level permissions later with a recipe that defines what we want for Drupal CMS. (We have an agent for installing and removing modules contributed by the community but this shouldn't be enabled by default and maybe shouldn't be in the agent's module, it was more a proof of concept for someone)
    4. Yes, hope to have something to show next week.
    • I agree with you on the Review Step, early testing with it confused people who didn't get Drupal. Fundamentally Drupal has all its own strange terminology (like taxonomy) and so a review step will introduce people to lots of terms they don't know if they need to know. It's hard to know how to review something when you don't know what it is you're reviewing and the consequences.
    • I definitely find the stress of picking which abstraction to go for is a big headache when starting with Drupal. It's getting easier but I'm sure everyone remembers having to figure out "Do I use a node or entity?". One cool thing about AI migration if we get it working is it will become easier to convert a node into an entity or a select list into a taxonomy or an image field into a media field. So when we're further ahead with migration it might be one to revisit.

    Re: Catch

    I mostly agree with your response and definitely think Workspaces looks like an amazing solution to this.

    Personally I've always thought the Minikanban approach here: https://www.youtube.com/watch?v=tXvIdjcB718 Could be really good as it can describe every step with images, links etc. Maybe it could be just a list of tasks in order instead of kanban.

    I've seen how other systems implement Agents and they always describe what they are going to do with each step and then output the JSON to do the steps. We had reasons not to do this but I think it might be good for us. One reason is that this counts as "Chain-of-thought" and will likely result in the output being more accurate. It will also help with debugging even if it doesn't perfectly match the JSON. We could also use AI review agents to check the description matches the JSON.

    I don't know if you're correct about expectations though. I don't know if the expectations are always going to be higher when it does it for you compared to if you do it yourself when it comes to GenAI. The ability for AI to get things wrong and hallucinate is so wide-spread and the probability you'll encounter AI Getting something wrong happens so soon that I think people will start to trust AI like they would trust a human intern, not like they would trust a calculator.

    Even with perfect AI models, I think meaningful human-in-the-loop will always be necessary in the same way I would review changes if a genius intern joined my organisation and started changing things on our website. Its one reason why I think Drupal will become the best and safest AI orchstration platform out there.

    I think an expectation that AI will get it wrong 80% of the time and I'll have to fix it will be fine. If you arn't a developer but a project manager/ owner/ designer etc. I don't think we have expectations that developers will get things right either and most of us have to build into our process some way of checking it. So with time and testing it will be good to fully research what expectations the persona we are aiming at has.

  • 🇬🇧United Kingdom yautja_cetanu

    Catch RE the example above. This seems to be a bug in the code not the LLM hallucinating. This will be caught by better automated testing with more test coverage.

    When we have the Evaluations module next week we'lll be able to see the prompt and response to see if this is a case of LLM hallucinating or not.

    There are two things above. The Agents are doing the work, there is something called the "Assistant API" that is orchestrating the agents and displaying answers to the end-user. Our first look is that the Agent replied with an error that the Assistant didn't catch properly. So its a bug rather than an inherent AI issue.

    Obviously both matter but they are solved differently. We'll know more when we can get proper logs.

  • 🇬🇧United Kingdom catch

    Catch RE the example above. This seems to be a bug in the code not the LLM hallucinating.

    Well I think it is just this though?

    You are correct that confirmation messages are provided by the LLM and there is a chance it can be different.

    For example, and apologies for any errors, going one what you've said so far rather than actually debugging, feel free to correct:

    1. User prompts 'Please make me a vocabulary with the top 20 wine regions'.

    2. AI Agent: 'Would you like me to create a taxonomy vocabulary with the top 20 wine regions'. - LLM request 1

    3. User prompts: 'Yes please'.

    4. AI agent: [Creates a vocabulary with the top 20 wine regions on the Drupal site, field etc.] LLM request 2

    5. AI agent: 'I've just created a taxonomy vocabulary with the top 20 wine regions [...] LLM request 3

    In Pam's example from #9, it looks like on step [4] the LLM responded with 'The top 20 wine regions' literally instead of a list of the 20 wine regions, and this is the reason for the mis-match.

    This means that from the initial user prompt and confirmation, there are three LLM requests which can return three different lists of wine regions or literal 'top 20 wine regions' or whatever comes out of the LLM each time (note that even the 'correct' list had Bordeaux twice, the second time with some kind of tongue in cheek comment, probably taken from the blog of a Bordeaux fan).

    From your response that

    . In fact we might actively want it to be different as we want the LLM to use plain english instead of specifically Drupal terms.

    this is in some ways 'by design' with the current model.

  • 🇦🇺Australia pameeela

    I think an expectation that AI will get it wrong 80% of the time and I'll have to fix it will be fine.

    I'm not sure if this is a typo? I think the expectation is that we need the AI to get it right 80% of the time.

  • 🇱🇹Lithuania mindaugasd

    Or 90% right?
    Primary value of this is magic experience to the user. And primary user of this is non-technical user.

  • 🇬🇧United Kingdom yautja_cetanu

    Yes it is a typo!! 80% right!

  • 🇬🇧United Kingdom yautja_cetanu

    Hi Catch, I see what you mean, especially you're last point, we may by design have things that could increase the probability of hallucinations. We really need to see the prompts. We're planning to do a very initial version of a systemic test of the AI agents tomorrow. I'll write up the issue about it today and post it here as if we could see the prompts above we could know.

    I think by far the most important thing is to try these prompts out in reality with real people who might use this and know little about Drupal and have a full log of exactly what is happening with the AI to know. I also believe we should ask people "Did it work?" and "Are you happy with this/ Would you use this again?" as I believe its important to answer the "Expectations" question you raise.

    We might find that for non-drupal users a low probability of success is still really cool for them as they had low expectations of AI as you had. We might find they treat AI like tradition computers and even if the success is high they get frustrated if it doesn't work.

    So if we describe the problem:

    Potential issue: AI reports in plain english what it did, but in reality it does something different.

    There are 3 possible causes:

    1. Full Hallucations: The AI describes what it does in one response and in the same response does something completely different. (Creates 4 terms but actually does 5 terms or maybe adds an image field).
    2. Introduces bugs/ typos in its code: The AI describes what it does and then tries to implement it but types out the instructions incorrect. It might have a typo so the JSON can't be parsed or might use the incorrect structure to define something.
    3. Bugs outside of AI: The AI gets everything correct but the Drupal code around it implements things incorrectly meaning the end-user gets told something has happened but it doesn't happen.

    For the problem you're speaking about all 3 are issues as they all result in the same thing as far as the end-user is concerned. Also all 3 are potentially side-stepped through the Workspaces approach you suggested. But they have different approaches to fixing them if you want to avoid it happening at all.

    I have an intuition from what I've seen that the full hallucinations is unlikely, but 2. is likely to very likely for some smaller opensource models and 3 is likely in the way all code is likely. However I do agree with you that to some degree we have made 1 more likely by design as we want it to describe what it does in plain english, not using Drupal terms but then its description of what its done is plain english. Currently also there are different LLMs writing the logs of what is done to the LLMs writing what it has done.

    I wonder if we want to do something like create a "Review" agent. Have the Agent that implements the taxonomy write up in plain english for the logs what it has done but describe it in Drupal language. Then we have a seperate agent that actually goes in and checks and then writes a description of what has happened. But we should see the results of evaluations first.

  • Pipeline finished with Failed
    about 1 month ago
    Total: 623s
    #336512
  • Pipeline finished with Failed
    about 1 month ago
    Total: 1037s
    #336513
  • 🇩🇪Germany marcus_johansson

    I have spent the last few days trying out and investigating the Workspaces module and also doing some structural changes based on the feedback here. For more info on this see this issue here: https://www.drupal.org/project/drupal_cms/issues/3487025#comment-15854019 📌 AI -KR3 - Making experimenting feel safe. Active

    I’ve added to the Merge Request a slightly more verbose response from the LLM so it gives you links to places where you can check what it's done.

    Also given Catch’s concerns about AI hallucinating what it's done I’ve also added a detailed drop-down that gives a detailed log of what has been done generated by Drupal, not an LLM and so it is accurate. We can turn off the details in Drupal CMS if we decide its not good for the ambitious site builder but if its there it will help with debugging and developers who test it.

  • Pipeline finished with Failed
    about 1 month ago
    Total: 612s
    #336603
  • Pipeline finished with Failed
    about 1 month ago
    Total: 637s
    #336604
  • 🇬🇧United Kingdom yautja_cetanu

    https://www.drupal.org/project/drupal_cms/issues/3467680 - I've added a roadmap and all the important related issues.

  • Pipeline finished with Failed
    about 1 month ago
    Total: 761s
    #337455
  • Pipeline finished with Failed
    about 1 month ago
    Total: 774s
    #337454
  • 🇬🇧United Kingdom tonypaulbarker Leeds

    So far I think the discussion has been of a technical nature, concerned with how we convert a request into code, essentially. I don’t think I have read anything about validating whether a request is a good idea and disambiguating the scope of the request?

    It’s something that human web support agents do before they process any request from a client.

    Some of what the user asks for will conflict with what is good for them and their website, their aims and even with the tools they have chosen. The best outcome may look very different to the initial request. We should be encouraging best practices and have AI acting as an adviser because we cannot expect our target to have expertise. What they ask for, like disabling privacy and accessibility features, may even be illegal. It’s something that can help Drupal CMS stand above competitors. If we don’t do that, it won’t take long for many inexperienced editors interacting with an AI agent to render their website ineffective on the front end.

    A couple of examples:

    If we have a user that asks to disable alt text, I would like if we first refer them to some information about alt text (we may not have that information written or available yet). In the initial request they may not understand what it does or why it’s important for accessibility and SEO. We could even see whether they have some tools installed that indicate that these things are important. And then some response and prompt so if they understand and are sure we action what they have requested. Maybe we even refuse this request and explain why.

    A similar thing might apply to understanding what will happen if we use high resolution images to the impact on performance, implications of using private files and so on.

    In a human interaction for the HD image request, from experience the advice I would usually give would be to ensure high resolution images were uploaded rather than to change the image upload rules and I would explain that the image that will be rendered will depend on the device and browser, and that the system image styles will optimise them. If there’s some work for AI to do, it’s to understand the context that the image will be used and optimise the image styles. I was chatting so someone just a couple of weeks ago who was proud that the images on his site were high res. After a quick discussion, he had no idea that loading many large image files on a mobile screen would result in a slow page, a large download of data and that the pixels rendered were limited by the device.

    You could have to present a lot of information to validate requests in this way, so I recognise that it’s early days and hard to find a way to strike a good balance to have interactions of a comfortable length. But to succeed I think it’s something we have to work toward.

  • 🇬🇧United Kingdom yautja_cetanu

    Hi Tony.

    Thanks for your thoughts and your ideas and yes what you've said is very important. Our initial demo on prompts was to show that it COULD do something but actually one benefit of AI agents is we can feed into it best practises exactly as you've mentioned. We can prompt the AI Agents to ask specific questions from Users or to assume certain things.

    For example, If a user wants to categorise the Content Type. We can prompt the AI to decide whether it should be a List (Text) Field or Taxonomy by looking at the type of field they are asking for and maybe asking the end-user questions. Above you can see discussion about whether or not "Select Lists" or "Tagging" are best practises.

    Our Goal right now is the evaluations module. When you ask the AI Agent to do something you can then click thumbs up and thumbs down on what the AI Agent did which is then stored and reported. It can be exported and so if someone has issues, the eval can be exported and someone else can then debug it, by importing the history of prompts and responses.

    This will also allow end-users who are ambitious sitebuilders (though probably need some knowledge of Drupal) to see what the prompts are and then change them in real-time to see if you get a better result. Once we have this (It's almost there, its just the reports are a little confusing to follow atm). It means community members like yourself can try out the AI Agent, click an evaluation, open it up, see the prompts for all the agents involved and suggest changes.

    If you use Claude or OpenAI, they have been trained on publically available Drupal data, the code and likely drupal.org and so they have a good idea themselves of these kinds of practises. However for smaller opensource models we need to put it into our prompts directly.

    The prompts for the AI agent are stored

    • Initially in a YAML file that comes with the AI Agent module.
    • But they can overriden in the DB for a specific site.
    • Also there is a place in the AI Agents settings where you can add your own instructions on top of the provided YAML (Will be better for taking updates to the underlying prompt later).

    As a result testers like yourself, could then post patches into the AI_Agents module or a Drupal CMS recipe for these kinds of best practises going into the prompt itself.

  • 🇬🇧United Kingdom tonypaulbarker Leeds

    Interesting stuff.

    This is a rhetorical question. How could we be weighting documentation for relevance?

    Certain pages are ‘official’ documentation, and more recently updated content is more likely to be relevant. Some other discussions may suggest solutions that are less optimal. I think we want to steer the models toward the most relevant stuff and help them make choices about conflicting information.

  • Pipeline finished with Failed
    about 1 month ago
    Total: 622s
    #349282
  • Pipeline finished with Failed
    about 1 month ago
    Total: 633s
    #349281
  • Pipeline finished with Failed
    28 days ago
    Total: 709s
    #352830
  • Pipeline finished with Failed
    28 days ago
    Total: 841s
    #352829
  • Pipeline finished with Failed
    27 days ago
    Total: 616s
    #354005
  • Pipeline finished with Failed
    27 days ago
    Total: 627s
    #354006
  • 🇺🇸United States thejimbirch Cape Cod, Massachusetts
  • 🇺🇸United States phenaproxima Massachusetts

    I actually think this isn't far off!

    The main problem is that it is very out of date with 0.x. We need to bring it back into sync, move some stuff around, and tag releases of the dependencies.

  • 🇺🇸United States phenaproxima Massachusetts

    To save a little bit of time and tedium, I took care of the merge conflicts and synced up the branch with 0.x.

  • Pipeline finished with Failed
    23 days ago
    Total: 93s
    #357040
  • Pipeline finished with Failed
    23 days ago
    Total: 93s
    #357043
  • Pipeline finished with Failed
    23 days ago
    Total: 671s
    #357045
  • Pipeline finished with Failed
    23 days ago
    Total: 680s
    #357319
  • Pipeline finished with Failed
    23 days ago
    Total: 795s
    #357320
  • Pipeline finished with Canceled
    23 days ago
    Total: 235s
    #357590
  • Pipeline finished with Canceled
    23 days ago
    Total: 240s
    #357591
  • Pipeline finished with Failed
    23 days ago
    Total: 712s
    #357595
  • Pipeline finished with Failed
    23 days ago
    Total: 735s
    #357596
  • Pipeline finished with Failed
    23 days ago
    Total: 44s
    #357609
  • Pipeline finished with Failed
    23 days ago
    Total: 45s
    #357610
  • Pipeline finished with Failed
    23 days ago
    Total: 682s
    #357626
  • Pipeline finished with Failed
    23 days ago
    Total: 1766s
    #357627
  • Pipeline finished with Success
    23 days ago
    Total: 525s
    #357747
  • Pipeline finished with Success
    23 days ago
    Total: 546s
    #357746
  • 🇩🇪Germany marcus_johansson

    This should be ready for review now again.

  • 🇺🇸United States phenaproxima Massachusetts

    Looks good to me! My feedback is all relatively minor and should be quick to address.

  • Pipeline finished with Failed
    23 days ago
    Total: 668s
    #357808
  • Pipeline finished with Failed
    23 days ago
    Total: 681s
    #357809
  • Pipeline finished with Canceled
    23 days ago
    Total: 491s
    #357847
  • Pipeline finished with Failed
    23 days ago
    Total: 495s
    #357848
  • 🇺🇸United States phenaproxima Massachusetts

    Looks great.

    I'm sorry we had to remove the Welcome link but you're right, we can't add it to the starter, and we also can't guarantee the starter has run. That's a tricky problem. The right approach, probably, is either to have the AI installation module create it dynamically, or move it to a different menu. Either approach would require some consideration from the product owner, so we probably don't want to block this on that.

    Otherwise this is a perfectly fine looking recipe (code-wise, anyway).

  • Pipeline finished with Failed
    23 days ago
    Total: 566s
    #357860
  • Pipeline finished with Failed
    23 days ago
    Total: 733s
    #357874
  • Pipeline finished with Skipped
    23 days ago
    #357918
  • 🇺🇸United States phenaproxima Massachusetts

    Whew! Merged into 0.x. Thanks!

  • Automatically closed - issue fixed for 2 weeks with no activity.

Production build 0.71.5 2024