Find answers from the community

Updated 3 months ago

long-context text chunks with PDFs

can't you just increase the chuck_size?
j
L
L
6 comments
the challenge is that if the length of the single split page content is shorter than the chunk size, the Document will still comprise only that page's content
for real? Does the default splitter always split the single page even if the content is < chunk_size?
This isn't the splitter, its that the default pdf loader loads a document object per page no matter what

You can easily combine that back into a single document object though
you just lose out on tracking which page the text came from
ah thanks @Logan M! yeah my thought with going that route is manually setting the page_label metadata value for the updated, larger documents (e.g. a chunk that comprises pages 1-10 gets a new page_metadata value of '1-10'); would be curious to hear if you think there's a smarter way to get at that issue
that sounds pretty resonable to me
Add a reply
Sign up and join the conversation on Discord