LlamaIndex

Log inLog into community

Find answers from the community

Updated last year

`SimpleDirectoryReader` chunks a PDF

`SimpleDirectoryReader` chunks a PDF

At a glance

The community members are discussing how to control the node sizing when using the SimpleDirectoryReader to chunk PDF files. They explore using the SimpleNodeParser to overwrite the defaults, and discuss how to un-chunk the documents. The community members also discuss how to properly save and load the docstore when using Chroma as the vector database, noting that the docstore needs to be manually saved and loaded, and cannot be stored directly in Chroma. Overall, the discussion covers customizing the node parsing and handling the docstore when using vector databases.

Useful resources

·

SimpleDirectoryReader chunks a PDF into nodes by default (pages), how do I control the node sizing? Do I need to call out each file type explicitly? Or can I use something like SimpleNodeParser to overwrite the defaults?

T

P

L

33 comments

Yeah with node parser and adjusting chunk size: https://docs.llamaindex.ai/en/stable/module_guides/loading/node_parsers/root.html#customization

Yeah I saw that, but it doesn't change how the docs exist. For example, they're already chunked into pages at documents=. I can't unchunk them.

documents = SimpleDirectoryReader("./data").load_data()

text_splitter = SentenceSplitter(chunk_size=1024, chunk_overlap=20)
service_context = ServiceContext.from_defaults(text_splitter=text_splitter)

index = VectorStoreIndex.from_documents(
    documents, service_context=service_context
)

You can un-chunk them though

mega_document = Document(text="".join([doc.text for doc in documents]))

just concat into one big document

Sorry, I mean I can do that. But why does SimpleDirectoryReader chuck?

They are pre-parsed into pages so that tracking sources is easier (i.e. page numbers)

it's in the metadata of each document object

But to override the behavior of DEFAULT_FILE_READER_CLASS I need to do so for each file type correct?

If I have markdown, pdf, whatever to load, can I chunk them all the same way if I wanted?

Sorry for the probably dumb questions, I'm learning Python and LLMs as I go.

Not for each file type

It will still use the same old default loaders for other file types

One more. I'm using Chroma locally for testing and docstore is empty. How do I properly save and the load the docstore?

Just have to run storage_content.docstore.add_documents? Or something else?

In other words I thought Chroma saved the doc store already

When using vector db integrations, we serialize all the nodes into the db, to simplify storage

You'll have to manually use/save/load the docstore if you need it

So for chroma docstore = SimpleDocumentStore.get_nodes(???) for loading from the DB?

Like I said, Python rookie.

Not quite

Plain Text

from llama_index.storage.docstore import SimpleDocumentStore

docstore = SimpleDocumentStore()
docstore.add_documents(documents)

docstore.persist(persist_path="./docstore.json")
loaded_docstore = SimpleDocumentStore.from_persist_path("./docstore.json")

something like that

But that docstore in Chroma...it'd be in the embedding metadata table I believe

I'll do a simple one to files, then add back chroma to compare the DB to the json version

Basically using this example, I can add docs to docstore okay but can't get the docstore to load from the chromadb. https://gpt-index.readthedocs.io/en/latest/examples/vector_stores/ChromaIndexDemo.html

You need to load/persist the docstore to disk like the above example I gave, or use a redis or mongodb docstore integration. You can't put the docstore in chromadb sadly

Ahh. Looking at the vector store comparison I saw the checkmark under docs for chroma: https://docs.llamaindex.ai/en/latest/module_guides/storing/vector_stores.html

Misunderstood it sounds like

Thought I could "recreate" the files from the metadata stored in Chroma.

Ah yea. However, when you perform a query or retrieval, you can get the source nodes and their metadata

So what would I add to this to save and then open the docstore? Does llamaindex support that? Or do I need to use the chroma library?

This being that 😄

Hey, thanks for all the help! It led me to exactly what I wanted to create. I know you get bombarded by a bunch of requests all the time, but I really appreciate it!

Glad you were able to create what you needed! 💪

Add a reply

Sign up and join the conversation on Discord