Find answers from the community

Updated 3 months ago

Still having a problem with incorrect

Still having a problem with incorrect page_labels, after sentenceSplitter, it loses the correct page_label, and the document metadata points to the last page of the PDF. Can't find any solutions on how to fix this problem
W
j
L
11 comments
I can see one metadata with page_label 4 in your shared dict
Attachment
image.png
If you do not use sentence splitter do you get the correct page_label?
Yes, without sentencesplitter everything is correct
I think, Sentence splitter only targets the text. It should not alter the metadata and that too specially only page_label.

Can you debug a little with some sort of debugger ?
Will try debugging, one min
providing some minimum example would be helpful too
document_reader = SimpleDirectoryReader(input_files=[file], filename_as_id=True) documents = document_reader.load_data(show_progress=True) for doc in documents: doc.id_ = filename doc.metadata["user"] = user splitter = SentenceSplitter(chunk_size=1024) nodes = splitter.get_nodes_from_documents(documents, show_progress=True) index = VectorStoreIndex(nodes=nodes, storage_context=storage_context, show_progress=True)
Sorry for late response
Found the problem:
for doc in documents: doc.id_ = filename doc.metadata["user"] = user
After removing these lines everything seems to be fine with page_labeling
Add a reply
Sign up and join the conversation on Discord