- Issue created by @somebodysysop
- π©πͺGermany mkalkbrenner π©πͺ
Newlines or breaks?
And why do you need them? Can you describe the use-case?
- πΊπΈUnited States somebodysysop
Newlines.
Here is the context: https://community.openai.com/t/using-gpt-4-api-to-semantically-chunk-doc...
I am developing a module called "SolrAI" which essentially facilitates RAG (Retrieval Augmented Generation) on a Drupal site. Search API Solr is a critical component of this module as it creates (via the Solr extractor) the text entities that are sent to the vector store for indexing.
Right now, this text is "chunked" (split into equal segments of x characters|tokens) by simply chopping it sequentially at every x characters.
I would like to introduce "semantic chunking" into my application. Here, the AI model would analyze each text entity, create a json file of it's hierarchy, and "chunk" it based upon that hierarchy.
The problem is, I need to be able to locate the segments within the text to extract. The best way I have found so far to do that (see the discussion in the link above) is by identifying specific lines in the text. This currently is not possible with the text rendered by the Solr extractor because it removes the newline characters.
Now, a workaround that I am thinking of is to use the Solr ID to retrieve the Drupal object (node, file or comment) and process that instead. But this method is prone to error and far less efficient.
This is why I would like to know if I could preserve the newline characters in the Search API Solr extractions.
- πΊπΈUnited States somebodysysop
For anyone else looking into something like this, I was able to find a workaround solution with this: https://solr.apache.org/guide/6_6/uploading-data-with-solr-cell-using-ap...
Using this command, I got a Solr output object:
curl "http://mySolrServer:8983/solr/labor/update/extract?extractOnly=true" --data-binary @/home/ron/workarea/ai/tika/article.pdf -H 'Content-type:application/pdf' > /home/ron/workarea/ai/tika/article.xml
Then, using this php script to read the .xml file, I was able to get the text of the pdf as it is currently stored in the Solr index, but with the linefeeds included.
// Read the JSON response from the XML file $json = file_get_contents('article.xml'); // Parse the JSON data $data = json_decode($json, true); // Extract the XML content from the JSON response $xml_content = $data['']; // Create a new DOMDocument object $dom = new DOMDocument(); // Load the XML content $dom->loadXML($xml_content); // Get the <body> element $body = $dom->getElementsByTagName('body')->item(0); // Initialize an empty string to store the extracted text $text_content = ''; // Iterate over the child elements of the <body> element $elements = $body->childNodes; foreach ($elements as $element) { if ($element->nodeType === XML_ELEMENT_NODE) { // Extract the text content of the element $text_content .= $element->textContent . "\n"; } } // Print or save the extracted text echo $text_content; // Or save to a file: file_put_contents('extracted_text.txt', $text_content);
For comparison, this is the source pdf: https://s3.us-west-2.amazonaws.com/docs.scbbs.com/docs/test/13_Article_1...
And this is the Solr (tika) output: https://www.drupal.org/files/issues/2024-05-18/extracted_text_from_tika.txt β
Still contains page numbers, headers and footers -- but so does the Solr index object.
- π©πͺGermany mkalkbrenner π©πͺ
It seems to be required to define a new text based field type for that use case. That is "easy".
But I need the complete requirements for that field type. For example, it should behave like XY, but with newline included.But another problem might be Search API which already splits a text in chunks. So we need to disable that functionality or recombine the chunks by adding newlines.
But again, I can't implement it based on guessing the requirements. I need your help to do it.
- πΊπΈUnited States somebodysysop
As I think about it, I don't want to compromise the current search api solr features, which work fine, for a functionality that I can work around.
What I decided to do was simply use another text extraction method rather than depend on the default Solr text extraction. To that end, I've discovered extractors (like LlamaParse) that can extract to Markup, which is even better.
So, in the end, not worth the trouble of creating a new field type. We've got enough of them as it is!
I appreciate you looking into this, but we might as well close it as will not fix or something.
Thanks!