Find answers from the community

Updated 3 months ago

Hey! I think there is an error with the sentence splitter. I try to use the hierarchical n

Hey! I think there is an error with the sentence splitter. I try to use the hierarchical node parser, but I am repeatedly getting this error. I tried to modify chunk sizes but it didn’t work out: what is the reason for this error? RecursionError Traceback (most recent call last)
<ipython-input-16-dcee5b2f043b> in <cell line: 1>()
----> 1 nodes = node_parser.get_nodes_from_documents(documents)

9 frames
... last 1 frames repeated, from the frame below ...

/usr/local/lib/python3.10/dist-packages/llama_index/text_splitter/sentence_splitter.py in _split(self, text, chunk_size)
178 new_splits.append(_Split(split, is_sentence=is_sentence))
179 else:
--> 180 ns = self._split(split, chunk_size=chunk_size)
181 if len(ns) == 0:
182 print("0 length split")

RecursionError: maximum recursion depth exceeded while calling a Python object
k
L
D
15 comments
Please stand by @DrSebastianK as I search the knowledge base. I normally take about 20 seconds or so. :wave:
----------------------
I'm sorry to hear that you're experiencing issues with the sentence splitter in LlamaIndex. Based on the error message you provided, it seems like you're encountering a RecursionError. This typically happens when a function calls itself (either directly or indirectly) too many times, exceeding Python's maximum recursion depth.

However, without more specific information about your implementation and the documents you're trying to parse, it's difficult to pinpoint the exact cause of the error. The error could be due to a variety of reasons such as the structure of your documents, the configuration of your node parser, or the specific text splitter you're using.

In the LlamaIndex documentation, it's mentioned that you can customize the text_splitter used by the node parser. The default is TokenTextSplitter, but you can also use SentenceSplitter, TokenTextSplitter, or CodeSplitter. Each of these splitters has different configuration options, including the default separator and additional config options. These might be useful if the language of your documents is significantly different from English. Here is an example of how to customize the SentenceSplitter:

Plain Text
import tiktoken
from llama_index.text_splitter import SentenceSplitter

text_splitter = SentenceSplitter(
  separator=" ",
  chunk_size=1024,
  chunk_overlap=20,
  paragraph_separator="\n\n\n",
  secondary_chunking_regex="[^,.;。]+[,.;。]?",
  tokenizer=tiktoken.encoding_for_model("gpt-3.5-turbo").encode
)

node_parser = SimpleNodeParser.from_defaults(text_splitter=text_splitter)

(Source: LlamaIndex Documentation ())
If you're still encountering issues, I would recommend checking the structure of your documents and the configuration of your node parser. If the problem persists, it might be helpful to reach out to the LlamaIndex community or the maintainers of the project for more specific guidance.
----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------
@DrSebastianK are you able to share the text you are parsing? I'd like to he able to reproduce this locally
@Logan M I used a directory of about 200 pdf's with the Bibtex parser of langchain (then creating llama index documents from it). I don't want to bother you with handling the full diretory, but I attached a pdf which is an element of the directory and it also results in an error (using the code step by step in the Auto merging retriever notebook). I just tested it and it still creates an error, but a little bit differently:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-24-d7192aad1549> in <cell line: 3>()
1 from llama_index.node_parser import HierarchicalNodeParser, SimpleNodeParser
2
----> 3 node_parser = HierarchicalNodeParser.from_defaults(docs0)

2 frames
/usr/local/lib/python3.10/dist-packages/llama_index/text_splitter/sentence_splitter.py in init(self, separator, chunk_size, chunk_overlap, tokenizer, paragraph_separator, chunking_tokenizer_fn, secondary_chunking_regex, callback_manager)
86 ):
87 """Initialize with parameters."""
---> 88 if chunk_overlap > chunk_size:
89 raise ValueError(
90 f"Got a larger chunk overlap ({chunk_overlap}) than chunk size "

TypeError: '>' not supported between instances of 'int' and 'Document'

It also results in an error of the sentence splitter. So I think if this error is handled, it should resolve the aforementioned recursion error.
Interesting, let me look into this today!
Thanks, I appreciate your help!
@Logan M Sorry, please consider the last error my own error. I made a mistake. But the first, recursive error still persists. I think the problem is that the system not supports this bibtex format. Btw. I use this code to create llama index documents: from llama_index import Document

List of all document groups

all_groups = [texts]

Create Document objects using fields from each BibTeX entry

documents = []
for group in all_groups:
for document in group:
# Access the metadata of each document
metadata = document.metadata

# Extract the necessary fields from the metadata
doc_id = metadata["id"]
published_year = metadata["published_year"]
text = document.page_content # Assuming this is where the document's content is stored

# Prepare the metadata for the Document
doc_metadata = {
"url": metadata["url"],
"Evidence type": metadata["Evidence type"],
"published_year": publishedyear, # Using .get() method to avoid KeyError if field doesn't exist "authors": metadata.get("authors", None), "title": metadata.get("title", None), "publication": metadata.get("publication", None), } documents.append( Document( text=text, id=doc_id,
metadata=doc_metadata,
excluded_llm_metadata_keys=['authors', 'title', 'publication', 'url', 'published_year'],
excluded_embed_metadata_keys=['authors', 'title', 'publication', 'url', 'published_year']
)
)
Hmm wait I'm confused on the process here lol

So you parsed that PDF you attached, but also parsed an associated bibtext file as metadata? Or something else?

Confused on how you initially got the document object in the code above πŸ€”
I was able to approximate what you had using that PDF, but I don't think it's quite the same. But in my case, it works fine

Plain Text
from llama_index import SimpleDirectoryReader
from llama_index.node_parser import HierarchicalNodeParser

documents = SimpleDirectoryReader(input_files=['./test_pdf.pdf']).load_data()

for doc in documents:
    metadata = {
        "url": "https://www.google.com",
        "Evidence Type": "evidence type",
        "published_year": "2023",
        "authors": "L. Markewich, K. Sebastian",
        "title": "A title",
        "publication": "A publication",
    }

    doc.metadata = {**metadata}
    doc.excluded_llm_metadata_keys = ["authors", "title", "publication", "url", "published_year"]
    doc.excluded_embed_metadata_keys = ["authors", "title", "publication", "url", "published_year"]

parser = HierarchicalNodeParser.from_defaults()

nodes = parser.get_nodes_from_documents(documents)

print(len(nodes))
  1. First I export a collection (including pdf attachments) from Zotero in bibtex format. This way I get the relevant metadata of each article.
  2. I use the following code, to load files from the previously created directory:

Specify the path to the .bib BibTeX file

file_path = "/content/tmd-comparativestudies/tmd-comparativestudies.bib"

Create an instance of the BibtexLoader

loader = BibtexLoader(file_path=file_path, max_content_chars=10000000) #This is set so high in order to avoid cropping of documents, thereby losing important informations.

Load the documents from the BibTeX file (note file not found error is often a source of error and stops the process, therefore we have the except)

try:
comparative = loader.load()
except FileNotFoundError as e:
print(f"File not found: {e}")
comparative = [] # set an empty list to continue the script

Iterate over the loaded documents

for study in comparative:
# Access the metadata of each document
metadata = study.metadata

# Set 'Evidence type' field to 'Comparative Study'
metadata['Evidence type'] = 'Comparative Study'

# Delete 'abstract' field if it exists
if 'abstract' in metadata:
del metadata['abstract']

print(metadata)

  1. Then I use the following code, to create llama_index document objects from the loaded texts:
from llama_index import Document

List of all document groups

all_groups = [comparative]

Create Document objects using fields from each BibTeX entry

documents = []
for group in all_groups:
for document in group:
# Access the metadata of each document
metadata = document.metadata

# Extract necessary fields from the metadata with .get() method
doc_id = metadata.get("id", None)
published_year = metadata.get("published_year", None)
text = document.page_content if hasattr(document, 'page_content') else ""

# Prepare the metadata for the Document
doc_metadata = {
"url": metadata.get("url", None),
"Evidence type": metadata.get("Evidence type", None),
"published_year": publishedyear, "authors": metadata.get("authors", None), "title": metadata.get("title", None), "publication": metadata.get("publication", None), } documents.append( Document( text=text, id=doc_id,
metadata=doc_metadata,
excluded_llm_metadata_keys=['url', 'published_year'],
excluded_embed_metadata_keys=['url', 'published_year']
)
)
Then I try using the documents with the hierarchical node parser, and eventually encountering the error. If you want I provide you the directory.
Finally I found the source of the error. There was a problem with the assignment of metadata. The working code is:
from llama_index import Document

List of all document groups

all_groups = [comparative]

Create Document objects using fields from each BibTeX entry

documents = []
for group in all_groups:
for document in group:
# Manually create a new metadata dictionary and exclude specific keys
doc_metadata = {
"url": document.metadata.get("url", None),
"Evidence type": document.metadata.get("Evidence type", None),
"published_year": document.metadata.get("published_year", None),
# Exclude other keys as needed
}

documents.append(
Document(
text=document.page_content if hasattr(document, 'pagecontent') else "", id=document.metadata.get("id", None),
metadata=doc_metadata,
excluded_llm_metadata_keys=['url', 'published_year'],
excluded_embed_metadata_keys=['url', 'published_year']
)
)
Thanks for the help anyways! πŸ™‚
Nice! πŸ‘πŸ’ͺ
Add a reply
Sign up and join the conversation on Discord