Find answers from the community

Updated 4 months ago

Still having a problem with incorrect

At a glance

Still having a problem with incorrect page_labels, after sentenceSplitter, it loses the correct page_label, and the document metadata points to the last page of the PDF. Can't find any solutions on how to fix this problem

11 comments

WWhiteFang_Jr

I can see one metadata with page_label 4 in your shared dict

Attachment

WWhiteFang_Jr

If you do not use sentence splitter do you get the correct page_label?

jjokubas.s

Yes, without sentencesplitter everything is correct

WWhiteFang_Jr

I think, Sentence splitter only targets the text. It should not alter the metadata and that too specially only page_label.

Can you debug a little with some sort of debugger ?

jjokubas.s

Will try debugging, one min

LLogan M

providing some minimum example would be helpful too

jjokubas.s

        document_reader = SimpleDirectoryReader(input_files=[file], filename_as_id=True)
        documents = document_reader.load_data(show_progress=True)


        for doc in documents:
            doc.id_ = filename
            doc.metadata["user"] = user

        splitter = SentenceSplitter(chunk_size=1024)

        nodes = splitter.get_nodes_from_documents(documents, show_progress=True)
        index = VectorStoreIndex(nodes=nodes, storage_context=storage_context, show_progress=True)

jjokubas.s

Sorry for late response

jjokubas.s

Found the problem:

jjokubas.s

        for doc in documents:
            doc.id_ = filename
            doc.metadata["user"] = user

jjokubas.s

After removing these lines everything seems to be fine with page_labeling

Add a reply