- Issue created by @SomebodySysop
- π©πͺGermany mkalkbrenner π©πͺ
Newlines or breaks?
And why do you need them? Can you describe the use-case?
- πΊπΈUnited States SomebodySysop
Newlines.
Here is the context: https://community.openai.com/t/using-gpt-4-api-to-semantically-chunk-doc...
I am developing a module called "SolrAI" which essentially facilitates RAG (Retrieval Augmented Generation) on a Drupal site. Search API Solr is a critical component of this module as it creates (via the Solr extractor) the text entities that are sent to the vector store for indexing.
Right now, this text is "chunked" (split into equal segments of x characters|tokens) by simply chopping it sequentially at every x characters.
I would like to introduce "semantic chunking" into my application. Here, the AI model would analyze each text entity, create a json file of it's hierarchy, and "chunk" it based upon that hierarchy.
The problem is, I need to be able to locate the segments within the text to extract. The best way I have found so far to do that (see the discussion in the link above) is by identifying specific lines in the text. This currently is not possible with the text rendered by the Solr extractor because it removes the newline characters.
Now, a workaround that I am thinking of is to use the Solr ID to retrieve the Drupal object (node, file or comment) and process that instead. But this method is prone to error and far less efficient.
This is why I would like to know if I could preserve the newline characters in the Search API Solr extractions.
- πΊπΈUnited States SomebodySysop
For anyone else looking into something like this, I was able to find a workaround solution with this: https://solr.apache.org/guide/6_6/uploading-data-with-solr-cell-using-ap...
Using this command, I got a Solr output object:
curl "http://mySolrServer:8983/solr/labor/update/extract?extractOnly=true" --data-binary @/home/ron/workarea/ai/tika/article.pdf -H 'Content-type:application/pdf' > /home/ron/workarea/ai/tika/article.xml
Then, using this php script to read the .xml file, I was able to get the text of the pdf as it is currently stored in the Solr index, but with the linefeeds included.
// Read the JSON response from the XML file $json = file_get_contents('article.xml'); // Parse the JSON data $data = json_decode($json, true); // Extract the XML content from the JSON response $xml_content = $data['']; // Create a new DOMDocument object $dom = new DOMDocument(); // Load the XML content $dom->loadXML($xml_content); // Get the <body> element $body = $dom->getElementsByTagName('body')->item(0); // Initialize an empty string to store the extracted text $text_content = ''; // Iterate over the child elements of the <body> element $elements = $body->childNodes; foreach ($elements as $element) { if ($element->nodeType === XML_ELEMENT_NODE) { // Extract the text content of the element $text_content .= $element->textContent . "\n"; } } // Print or save the extracted text echo $text_content; // Or save to a file: file_put_contents('extracted_text.txt', $text_content);
For comparison, this is the source pdf: https://s3.us-west-2.amazonaws.com/docs.scbbs.com/docs/test/13_Article_1...
And this is the Solr (tika) output: https://www.drupal.org/files/issues/2024-05-18/extracted_text_from_tika.txt β
Still contains page numbers, headers and footers -- but so does the Solr index object.