Find answers from the community

Updated 6 months ago

long-context text chunks with PDFs

At a glance

can't you just increase the chuck_size?

6 comments

the challenge is that if the length of the single split page content is shorter than the chunk size, the Document will still comprise only that page's content

LLeonardo Oliva

for real? Does the default splitter always split the single page even if the content is < chunk_size?

LLogan M

This isn't the splitter, its that the default pdf loader loads a document object per page no matter what

You can easily combine that back into a single document object though

LLogan M

you just lose out on tracking which page the text came from

jjimmyjazz31

ah thanks @Logan M! yeah my thought with going that route is manually setting the page_label metadata value for the updated, larger documents (e.g. a chunk that comprises pages 1-10 gets a new page_metadata value of '1-10'); would be curious to hear if you think there's a smarter way to get at that issue

LLogan M

that sounds pretty resonable to me

Add a reply