Implement pgvector (postgeSQL) vector db provider

Created on 15 October 2024, 7 months ago

Problem/Motivation

postgreSQL is a widely used and widely available database engine that also supports vector similarity searches with the pgvector extension.

Given its ease of setup and wide availability across cloud providers, this seems like a low cost solution which performs acceptably on small to medium datasets.

Steps to reproduce

N/A

Proposed resolution

Create a vdb_provider_pgvector submodule of drupal/ai in vdb_providers that implements a PgvectorProvider plugin.

Remaining tasks

  • Create vdb_provider_pgvector module that implements a PgvectorProvider plugin
  • Amend existing API to support pgvector's 'at query' metric type selection

User interface changes

  • New pgvector settings form at /admin/config/ai/vdb_providers/pgvector

API changes

  • Pass metric type to vectorSearch plugin function

Data model changes

  • New settings schema for the pgvector vdb provider

I have a proof of concept of this working. I'm creating this issue so that I can create a fork and push my work there.

I'll tidy that work up and run some further tests locally.

When it's ready for review, I'll update the issue status.

📌 Task
Status

Active

Version

1.0

Component

AI Search

Created by

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Comments & Activities

  • Issue created by @joshhytr
  • 🇬🇧United Kingdom scott_euser

    Hi Josh - very exciting! Just to note our plan before beta is to move out the ai_providers and ai_vdb_providers into separate modules. E.g. feel free to create ai_vdb_provider_pgvector (or ai_vdb_provider_postgres?) already if you prefer.

    There is also similar discussion in SOLR Dense Vector Search Active

    Also Thursday planning on doing a virtual meetup and have a good chunk of time to chat AI Search: https://www.drupal.org/community/events/drupal-ai-meetup-2024-10-17

  • Hey Scott,

    Sounds great. I'll definitely come along to the meetup.

    With regard to the module naming, I am of two minds. I've seen postgres with pgvector simply referred to as pgvector in a few spaces online, but given postgres is the actual engine that's powering it, maybe that is the better name. Do you think so?

  • 🇬🇧United Kingdom scott_euser

    Hmmm I guess it depends a bit on your plans for it; would it just be the vector side of it, or also the rest of postgres? For example Pinecone & Milvus/Zilliz in AI module both have filtering options outside of the vectors, e.g. to just retrieve relevant embeddings within a subset of the content (though both are somewhat limited in their filtering support).

    But maybe anyways with vectors being the focus pgvector more appropriate given wider postgres in core and I guess search_api_db kinda covers that then.

  • The version I have at the moment does handle the 'filters' string passed to querySearch or vectorSearch.

    I've yet to implement the prepareFilters and prepareConditionGroup functions, that's my next task, but I do intend for it to be able to run a vector similarity search on a filtered subset of results.

    This module/plugin will only be used when working with vectors and maybe the specificity of 'pgvector' implies that use case and the extension requirement. As you say, core and search_api_db cover the native postgres use cases.

    But I'm still unsure given pgvector is not actually the engine and ai_vdb_provider_postgres somewhat namespaces itself within the vector space. Maybe postgres is the more proper name.

  • I'll go for postgres.

    vdb_provider_postgres in the short term and for this fork.

    I'll create a stub ai_vdb_provider_postgres module page for the beta and port there once this issue has been merged.

  • 🇬🇧United Kingdom scott_euser

    Sounds good, happy to help test/feed back here, probably makes sense to not merge though and immediately move over, but I'll check with Jamie and Marcus who do the most heavy lifting in this module.

  • Brilliant, thanks for your help!

  • 🇩🇪Germany marcus_johansson

    Ah, lets clsoe this issue.

Production build 0.71.5 2024