Training a Large Language Model for Code (Code LLM) - an expert Drupal A.I.

Created on 21 March 2023, almost 2 years ago
Updated 13 June 2024, 7 months ago

Problem/Motivation

I was asking GPT-4 Drupal questions, and I received better and more personalized answers (and very quickly) than documentation could provide.

ChatBots is the way of the future to learn/create new things.

But it could be still a lot better. This issue grew from #3336313: Use ChatGPT for solving Drupal issues to increase rate of development β†’ .

We could say many fold improvement is possible, given GPT would be aware of documentation, Drupal codebase, all modules codebase, API docs, change logs, commit history, all issues and comments, translations, even books.

Having DrupalGPT with extensive knowledge of everything around drupal, would enable to increase rate of innovation as described and demonstrated in parent issue.

Steps to reproduce

1. Ask GPT-4 "Make an extensive list of places and resources to learn Drupal"
2. GPT-4 produces very high quality answer.

This is because GPT's main learning source were contents of high quality links on Reddit website.

Proposed resolution

There is still an open question how to teach AI about Drupal to a very high degree of expertise, this issue is created to explore this.

Remaining tasks

  1. Gathering Drupal data
  2. Processing data for training
  3. Selecting an existing AI model for Drupal LLM to extend
  4. Training a model.
  5. Creating a Drupal module which will connect to Drupal LLM (through Rest API)

Understanding chatGPT's current level of Drupal expertise

Done. Result: expertise is quite good.

Gathering drupal data

Drupal data sources:

Processing data for training

  • ChatGPT suggested to use code of quality modules, recent versions of Drupal, to not include commits of temporary fixes, redundant changes, intermediate code, and other.

Selecting an existing AI model for Drupal LLM to extend

Criteria:

  • Model has been trained on code (example - StarCoder)
  • Model is small but smart (example - Orca)
  • Model knows all APIs of the world (example Gorilla with 1'600 APIs or ToolLLM with 16'000 APIs)
  • Model has license compatible with Drupal (enter example here)
  • Model has huge context window (enter example here)

Training a model

Creating a Drupal module which will connect to Drupal LLM (through REST API)

Libraries which allows to run AI models and provides REST API which Drupal can connect to.

✨ Feature request
Status

Active

Component

General

Created by

πŸ‡±πŸ‡ΉLithuania mindaugasd

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Comments & Activities

  • Issue created by @mindaugasd
  • πŸ‡±πŸ‡ΉLithuania mindaugasd

    Added info to description about where GPT's training data originates from.
    It originates from Reddit quality links.

    It is unclear how to improve GPT going forward with more extensive knowledge of Drupal internals beyond what Reddit can provide.
    Possibly this is a matter of OpenAI team's decisions.

  • πŸ‡±πŸ‡ΉLithuania mindaugasd

    To achieve this, we likely need unsupervised teaching on all Drupal data.

    But given romours: OpenAI is using supervised learning to teach AI how to code going forward.
    It means that they cherrypick the correct data while employing thousands of programmers around the world:
    https://futurism.com/the-byte/openai-replace-entry-level-coders-ai

    That being the case, Drupal can get overlooked if not specifically chosen.
    Given Drupal is absent at Github, this is a real possibility.

  • πŸ‡­πŸ‡ΊHungary GΓ‘bor Hojtsy Hungary

    I would argue that those reddit links, etc. reference drupal.org docs including api.drupal.org as well. So I am not sure we should be transitioning (ie. removing the existing training data)? I'm not sure exactly what are you proposing here?

  • πŸ‡±πŸ‡ΉLithuania mindaugasd

    @GΓ‘bor Hojtsy issue title is not the best. I mean, people will transition from using documentation, to using ChatGPT as main information source.

    Proposed resolution

    There is still an open question how to teach AI about Drupal to a very high degree of expertise, this issue is created to explore this.

    I know smaller open source communities who are completely uncovered by ChatGPT, because they were not exposed to Reddit like Drupal is. So they unable to use ChatGPT for their development.

    Going forward, it is important to make sure Drupal is included in OpenAI training. There is a real possibility Drupal can be excluded. I don't know how to solve this, it is an open question.

    Also there is an opportunity to feed 90% of Drupal data, instead of just 5% (guessing). This would greatly improve quality of ChatGPT responses, and allow to do things described in parent issue.

    Recently I posted a conversation transcript with ChatGPT showing the shortcomings and possibilities of DrupalGPT if amount of training data would be increased.

  • πŸ‡±πŸ‡ΉLithuania mindaugasd

    One drupalist shared info about project called https://astro.build/ for website building.

    They have created https://houston.astro.build/, which is similar to proposed DrupalGPT.

    As their AI describes itself:

    I've read the entire Astro Docs (a few thousand times). Ask me how to do anything in Astro. I'll do my best to answer!

    GitHub: https://github.com/withastro/houston.astro.build

    Tool was built following this tutorial: https://dagster.io/blog/chatgpt-langchain

    Using: https://github.com/hwchase17/langchain

    I asked GPT-4 to summarize the tutorial:

    ChatGPT, released a few months ago, demonstrated the power of large language models in answering questions from a broad knowledge set. The Dagster core team sought to create a Slack bot using GPT-3 to answer basic technical questions about the Dagster GitHub project. However, fine-tuning GPT-3 was seen as expensive, and the team chose to proceed without it.
    
    Instead, they employed the LangChain library, which included Data Augmented Generation to combine GPT-3's capabilities with the knowledge encoded in the Dagster documentation. The team succeeded in building a working chatbot by engineering prompts for the large language model and using a vector space search engine to improve efficiency.
    
    However, the chatbot experienced slow startup time, and the team utilized Dagster, a data orchestration solution, to build a data pipeline with caching capabilities for better performance and reduced costs. With this solution in place, the chatbot improved its response times and leveraged Dagster's features to deal with scheduling, retraining, and error handling. Future work could focus on integrating with Slack, refining document chunking, and experimenting with fine-tuning.

    Follow up question: Does LangChain library use GPT-3 in some shape or form?

    Yes, the LangChain library is designed to work with large language models, such as GPT-3. It provides a way to combine GPT-3's capabilities with additional sources of knowledge or computation to develop more powerful applications. In the example from the article, the authors used LangChain to augment GPT-3's knowledge with the information encoded in the Dagster documentation, enabling better question-answering performance in their chatbot.

  • πŸ‡ΊπŸ‡ΈUnited States matt_paz

    One drupalist shared info about project called https://astro.build/ for website building.

    Said, Drupalist here chiming in with more info on this.

    Here's a link to the original thread on slack:
    https://drupal.slack.com/archives/C04NCQGGX0E/p1679583124916809

    Subsequent to this, I noticed some additional chatter on Astro's Discord by Fred K. Schott (one of the maintainers).

    https://discord.com/channels/830184174198718474/872579324446928896/10899...

    Excerpting from the thread above he went on to note:

    haha I'd actually trust something like https://www.kapa.ai/ more than Houston AI at this point!

    Houston AI was a fun experiment at a time when no one was doing AI docs, but I think has suffered from lack of ownership / love and support since its launch

    Definitely something to ponder should efforts from this issue take root.

  • πŸ‡ΊπŸ‡ΈUnited States matt_paz

    This is because GPT's main learning source were contents of high quality links on Reddit website.

    Also not sure that it exactly true (needs citations), but I grok the sentiment.

    I'm not sure exactly what are you proposing here?

    I wonder if the original issue here might need to be reframed a bit. Possibly as an exploration for how the Drupal Association might leverage AI/ML (using ChatGPT or other sources) to provide additional, drupal specific tooling (in this case, documentation, or in the case of https://www.drupal.org/project/ideas/issues/3336313 πŸ“Œ Use ChatGPT for solving Drupal issues to increase rate of development Active , maybe ChatGPT plugins, or maybe some Drupal version specific versions of Co-Pilot style add-ons) that could help reduce the so-called learning cliff with Drupal and improve adoption and D/X overall?

    The space is moving so fast, it is hard to determine what the right entry point might be, but I wouldn't want to damper interest/enthusiasm. Just maybe pair it with some pragmatism about scope and precision for where to start -- and what the trajectory might look like?

    Definitely lots of potential here!

  • πŸ‡±πŸ‡ΉLithuania mindaugasd

    For now, I defined 3 steps in issue description and I think it will remain true into the future as well.
    We will need 1) to prepare drupal data, then 2) to train a model, and then 3) to use it (making accessible to everyone).

    To define even more specific route, requires either cutting edge expert knowledge, or waiting until go-to solution is sorted out across the world.

    For now, we can start thinking how to prepare drupal data.

  • πŸ‡±πŸ‡ΉLithuania mindaugasd

    kapa.ai learns from your docs, code, chat logs, and GitHub issues to generate a chatbot that answer developer questions automatically

  • πŸ‡±πŸ‡ΉLithuania mindaugasd

    Updated issue description with information how to gather DrupalGPT training data, testing existing chatGPT knowledge and further detailed plan.

  • πŸ‡±πŸ‡ΉLithuania mindaugasd

    Every drupal development question I ask, GPT-4 answers it correctly.

    I would rate current state 9 out of 10!

    Missing piece is up-to-date information, and something else I have not discovered yet (or forgot).

    And talking about discovery, GPT-4 is very good for learning new drupal things or solving things I never done before. GPT-4 already knows the answers and it saves a lot of time researching for best drupal (and related) solutions.

  • πŸ‡ͺπŸ‡¨Ecuador jwilson3

    GPT-4 already knows the answers

    Are you suggesting then that we dont need a DrupalGPT ?

    GPT-4 is very good for learning new drupal things.

    Do we know how it is learning?

    It invents non existing solutions sometimes, but reasonable solutions, so something to work with or follow up with other questions.

    How does the community, as resident human operators and end-users of GPT-4, train it away from bad answers? How does one avoid bad actors training it in the wrong way with intentionally incorrect (or insecure) solutions?

  • πŸ‡±πŸ‡ΉLithuania mindaugasd

    Are you suggesting then that we dont need a DrupalGPT ?

    DrupalGPT could:

    1. know the latest code and developments within drupal ecosystem, (while GPT-4 does not know developments after 2021 so cannot be used for cutting edge development)
    2. could know more about drupal (it is not clear how much GPT-4 knows already)

    Do we know how it is learning?

    I meant GPT-4 is a great tool for learning.
    GPT-4 itself does not learn - it is trained by openAI. How OpenAI creates/improves it - it is not known. It is a company secret. At the beginning they were open, later became closed.

    train it away from bad answers?

    Quality of training data is super important when training AI.

    How does one avoid bad actors training it?

    How to train it in the first place? This is a complex for me to answer for today.

  • πŸ‡©πŸ‡°Denmark ressa Copenhagen

    Thanks for investigating this new and interesting invention @mindaugasd, and thank you for very relevant questions @jwilson3.

    We could also build a Drupal Open Assistant, based on Open Assistant, a free and open source AI project.

    By feeding it data, training it, and hosting Drupal Open Assistant ourselves, we would have 100% control over the input and output, as opposed to a black box like OpenAI/ChatGPT.

    Open Assistant - chat based large language model (Github).

  • πŸ‡΅πŸ‡±Poland jorgik

    The good case of training that are Drupal books. Here is the example in the small site I've created with chatbot:
    https://drupal-docs-list.vercel.app/
    The site itself it's just representation of the Drupal wiki, but chatbot is trained with 2 books of Drupal Masters. Try to ask something about module development.

  • πŸ‡±πŸ‡ΉLithuania mindaugasd

    Drupal books cannot be used for training, because of intellectual property.

    I tried asking many questions, but each answer was disappointing:

    • Most of the time it refused to answer
    • often the answers were not visible because of UI bug
    • only once it answered most basic question, but did far worse job than GPT-4.

    Anyway, it was interesting to try.

  • πŸ‡ΊπŸ‡ΈUnited States cosmicdreams Minneapolis/St. Paul

    After listening to Google I/O today, to me, this sounds like a natural outcome to Google's Project Tailwind: https://thoughtful.sandbox.google.com/about

    It allows you to be selective on what the AI is trained on.

  • πŸ‡΅πŸ‡±Poland jorgik

    @mindaugasd

    This is actually not a GPT-4, this is just a simple Lang Chain based solution like this https://github.com/hwchase17/langchain you mentioned earlier. Just trained on couple of books only.

    Good idea from @ressa to build OpenAssistan based model for Drupal docs and APIs and put it here on site to help with code creating and general assistance.
    +1 for OpenAssistan based model for Drupal docs and APIs

  • πŸ‡¬πŸ‡·Greece pinkonomy

    +1 for using Open Assistant.
    I propose Drupal Association embed Open Assistant in Drupal.org, so it can be used for training, helping with modules, etc.

  • πŸ‡±πŸ‡ΉLithuania mindaugasd

    -0 my vote (unsure).

    StarCoder (https://huggingface.co/blog/starcoder) - a new open source model trained on code for coding.
    I think this is first open source model created for coding.

  • πŸ‡©πŸ‡°Denmark ressa Copenhagen

    Interesting, thanks for sharing @mindaugasd.

    From my perspective, it isn't so important if Open Assistant, StarCoder, or some other project is used as Large Language Model for Code (Code LLM) for Drupal, just as long as it's truly free and Open Source, 100% transparent, etc.

    Is StarCoder Open Source, or a MicroSoft (VS Code/Github/GitHub Copilot) product? As I see it, getting too dependent on huge monopolies such as Google, Microsoft runs counter to the Open Source ethos as outlined by Raymond and Stallman. Drupal can help steer the development in an Open Source direction, and should play an active role in my opinion.

    The issue title and summary are fairly ChatGPT oriented (understandably so, it was the first to arrive) but things move fast and could perhaps be made more agnostic? I have updated the title for a start.

  • πŸ‡©πŸ‡°Denmark ressa Copenhagen

    Interesting module https://www.drupal.org/project/aidev β†’ , thanks for building it @mindaugasd :)

  • πŸ‡±πŸ‡ΉLithuania mindaugasd

    theoretically, model could be invented specifically for reasoning, or specifically for Drupal reasoning. Super specialized small and smart Drupal model. Possibly possible, but not practical at this point [...] [ link of that discussion 🌱 Fully fledged website and module development with Artificial Intelligence Active ]

    Since it is theoretically possible, so it was done "Orca: The model few saw coming"
    https://www.youtube.com/watch?v=Dt_UNg7Mchg

    This is one of the best youtube channels covering very recent and important research of AI developments, very well done.

  • πŸ‡±πŸ‡ΉLithuania mindaugasd

    Initiated a new module

    AI models library

    β†’
    https://www.drupal.org/project/aimodels β†’

    Which will also be helpful for configuring Drupal LLMs in the future

  • πŸ‡±πŸ‡ΉLithuania mindaugasd

    There is a new very nice library called OpenLLM which has Flexible APIs and supports a wide range of State-of-the-art open LLMs (including StarCoder) πŸ“Œ Open LLMs support Active

  • πŸ‡΅πŸ‡ͺPeru DevElCuy

    Can we explore LocalAI that comes with LlamaCPP (so any ggml model is supported)?

  • πŸ‡±πŸ‡ΉLithuania mindaugasd

    @DevElCuy I see LocalAI has more stars on Github than OpenLLM. If it is better, than yes.

    I updated issue description:
    1. Added category "Creating a Drupal module which will connect to DrupalGPT (through REST API)"
    That Drupal module would implement LocalAI or OpenLLM API.
    And AI models library β†’ idea is deprecated in favor of having many separate modules for integrating AI APIs, so the need to create this new module for this specific purpose.

    2. Added category "Selecting an existing AI model for Drupal LLM to extend"

    Criteria:

    • Model has been trained on code (example - StarCoder)
    • Model is small but smart (example - Orca)
    • Model knows all APIs of the world (example Gorilla with 1'600 APIs or ToolLLM with 16'000 APIs)
    • Model has license compatible with Drupal (enter example here)
    • Model has huge context window (enter example here)

    Sending two nice videos for inspiration:

    I was thinking 🌱 Fully fledged website and module development with Artificial Intelligence Active how to adopt GorillaLLM capabilities before, so the answer might be integrating it within DrupalGPT capabilities.

  • πŸ‡±πŸ‡ΉLithuania mindaugasd

    Hugging Face tutorial about training code model
    Personal Copilot: Train Your Own Coding Assistant
    It says its possible with 1 online GPU.

  • πŸ‡¦πŸ‡ΊAustralia pameeela

    I think this issue is in the same boat as πŸ“Œ Use ChatGPT for solving Drupal issues to increase rate of development Active . Is there a specific idea or proposal for Drupal core to adopt here? If not, can this be moved into another queue?

  • πŸ‡±πŸ‡ΉLithuania mindaugasd

    Yes, the same boat as another issue, I am moving it to AI initiative project.

Production build 0.71.5 2024