Find answers from the community

Updated last year

I'm looking through the nodes coming

At a glance

I'm looking through the nodes coming from my PDF documents. It seems like the PDFs themselves via SimpleDirectoryReader always get split up into separate document objects, by the page, is that right? So when they get further chunked into nodes, is it the case that a node will never span across multiple pages? If so, that seems to limit the flexibility of the nodes themselves to for example capture an entire concept in one node if the node happens to span across pages.

3 comments

LLogan M

You can merge all the nodes from one PDF into a single document object if you want

It's split by default to help with citing sources

DDarthus

Is there a function/argument to do that, or just iterating through the text elements and stitching?

LLogan M

just the latter. In python, it's basically a single line of code

Plain Text

from llama_index import Document
documents = ....
document = Document(text="".join(x.text for x in documents))

Add a reply