Newline Characters in Solr Index

Created on 25 April 2024, 2 months ago
Updated 19 May 2024, about 1 month ago

Respect the submission guidelines above! Drupal.org issue forks cause additional work for the project maintainer!

Setup

  • Solr version: 8.4.0
  • Drupal Core version: 10.1.5
  • Search API version: 8.x-1.31
  • Search API Solr version: 431
  • Configured Solr Connector: standard

Issue

I noticed that when my nodes/files are indexed in Solr, the newlines characters are removed.

That is, when I search for any indexed node or file in the Solr index and look at the content entity field, I see the full text of the source object minus the newline characters.

My question is, is there a setting where I can keep the newlines, and if so, where is it and how do I set it?

And, is this a good idea? I assume there is a reason that the newline characters are removed in the indexed text. Why is this?

Thanks for any guidance.

πŸ’¬ Support request
Status

Active

Version

4.3

Component

Code

Created by

πŸ‡ΊπŸ‡ΈUnited States SomebodySysop

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Comments & Activities

  • Issue created by @SomebodySysop
  • πŸ‡©πŸ‡ͺGermany mkalkbrenner πŸ‡©πŸ‡ͺ

    Newlines or breaks?

    And why do you need them? Can you describe the use-case?

  • πŸ‡ΊπŸ‡ΈUnited States SomebodySysop

    Newlines.

    Here is the context: https://community.openai.com/t/using-gpt-4-api-to-semantically-chunk-doc...

    I am developing a module called "SolrAI" which essentially facilitates RAG (Retrieval Augmented Generation) on a Drupal site. Search API Solr is a critical component of this module as it creates (via the Solr extractor) the text entities that are sent to the vector store for indexing.

    Right now, this text is "chunked" (split into equal segments of x characters|tokens) by simply chopping it sequentially at every x characters.

    I would like to introduce "semantic chunking" into my application. Here, the AI model would analyze each text entity, create a json file of it's hierarchy, and "chunk" it based upon that hierarchy.

    The problem is, I need to be able to locate the segments within the text to extract. The best way I have found so far to do that (see the discussion in the link above) is by identifying specific lines in the text. This currently is not possible with the text rendered by the Solr extractor because it removes the newline characters.

    Now, a workaround that I am thinking of is to use the Solr ID to retrieve the Drupal object (node, file or comment) and process that instead. But this method is prone to error and far less efficient.

    This is why I would like to know if I could preserve the newline characters in the Search API Solr extractions.

  • πŸ‡ΊπŸ‡ΈUnited States SomebodySysop

    Any suggestions? Anything?

  • πŸ‡ΊπŸ‡ΈUnited States SomebodySysop

    For anyone else looking into something like this, I was able to find a workaround solution with this: https://solr.apache.org/guide/6_6/uploading-data-with-solr-cell-using-ap...

    Using this command, I got a Solr output object: curl "http://mySolrServer:8983/solr/labor/update/extract?extractOnly=true" --data-binary @/home/ron/workarea/ai/tika/article.pdf -H 'Content-type:application/pdf' > /home/ron/workarea/ai/tika/article.xml

    Then, using this php script to read the .xml file, I was able to get the text of the pdf as it is currently stored in the Solr index, but with the linefeeds included.

    // Read the JSON response from the XML file
    $json = file_get_contents('article.xml');
    
    // Parse the JSON data
    $data = json_decode($json, true);
    
    // Extract the XML content from the JSON response
    $xml_content = $data[''];
    
    // Create a new DOMDocument object
    $dom = new DOMDocument();
    
    // Load the XML content
    $dom->loadXML($xml_content);
    
    // Get the <body> element
    $body = $dom->getElementsByTagName('body')->item(0);
    
    // Initialize an empty string to store the extracted text
    $text_content = '';
    
    // Iterate over the child elements of the <body> element
    $elements = $body->childNodes;
    foreach ($elements as $element) {
        if ($element->nodeType === XML_ELEMENT_NODE) {
            // Extract the text content of the element
            $text_content .= $element->textContent . "\n";
        }
    }
    
    // Print or save the extracted text
    echo $text_content;
    // Or save to a file:
    file_put_contents('extracted_text.txt', $text_content);
    

    For comparison, this is the source pdf: https://s3.us-west-2.amazonaws.com/docs.scbbs.com/docs/test/13_Article_1...

    And this is the Solr (tika) output: https://www.drupal.org/files/issues/2024-05-18/extracted_text_from_tika.txt β†’

    Still contains page numbers, headers and footers -- but so does the Solr index object.

Production build 0.69.0 2024