Length of dynamic field in Milvus is limited to 65536

Created on 21 August 2024, 3 months ago

Problem/Motivation

Attempt to index field with content longer than 65536 (including vector size and all metadata fields) attracts API response code 1100.

Proposed resolution

The issue is a bit complicated by the fact, that it is not caught by Milvus driver, but only a API call, and we do not know how exactly the length is calculated there.

We can proactively trim the content field, but I suggest we catch 1100 code in vdb_provider_milvus/src/Plugin/VdbProvider/MilvusProvider.php, method insertIntoCollection(), trim the field based on lengths all the fields and vector, and try again. I suggest something like this:

public function insertIntoCollection(
    string $collection_name,
    array $data,
    string $database = 'default',
  ): void {
    $processed = FALSE;
    while (!$processed) {
      $response = json_decode($this->getClient()->vector()->insert(
        collectionName: $collection_name,
        data: $data,
        dbName: $database,
      ), TRUE);

      if (!isset($response['code'])) {
        throw new \Exception("Failed to record vector.");
      }

      switch ($response['code']) {
        case 1100:
          $this->sanitizeMaxLength($data);
          break;
        case 200:
          $processed = TRUE;
          break;
        default:
          throw new \Exception("Failed to record vector.");
      }
    }
  }

And then:

/**
   * Trim the data.
   * 
   * @throws \Exception
   */
  private function sanitizeMaxLength(&$data): void {

    // Nothing to do, if we do not have content field or it is empty.
    if (!isset($data['content']) || (strlen($data['content']) == 0)) {
      throw new \Exception("Failed to record vector.");
    }

    $total_length = $this->countLength($data);

    // If the content is too long, shorten the content by a calculated value.
    if ($total_length > 65536) {
      $difference = 65536 - $total_length;
    }
    // If the calculated content is shorter, but API still reports the issue
    // shorten the content by additional 5%.
    else {
      $difference = -max(1, (int) (strlen($data['content']) * 0.05));
    }
    $data['content'] = substr($data['content'], 0, $difference);
  }

  /**
   * Calculate size of data.
   * 
   * @param $data
   *
   * @return int
   */
  private function countLength($data): int {
    $total_length = 0;
    foreach ($data as $key => $value) {
      if ($key !== 'vector') {
        $total_length += strlen((string) $value) + strlen($key) + 22;
      } else {
        $total_length += count($value) + 28;
      }
    }
    return $total_length;
  }
🐛 Bug report
Status

Active

Version

1.0

Component

AI Search

Created by

🇬🇧United Kingdom seogow

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Merge Requests

Comments & Activities

Production build 0.71.5 2024