Find answers from the community

Updated last year

`SimpleDirectoryReader` chunks a PDF

At a glance
SimpleDirectoryReader chunks a PDF into nodes by default (pages), how do I control the node sizing? Do I need to call out each file type explicitly? Or can I use something like SimpleNodeParser to overwrite the defaults?
T
P
L
33 comments
Yeah I saw that, but it doesn't change how the docs exist. For example, they're already chunked into pages at documents=. I can't unchunk them.
documents = SimpleDirectoryReader("./data").load_data() text_splitter = SentenceSplitter(chunk_size=1024, chunk_overlap=20) service_context = ServiceContext.from_defaults(text_splitter=text_splitter) index = VectorStoreIndex.from_documents( documents, service_context=service_context )
You can un-chunk them though

mega_document = Document(text="".join([doc.text for doc in documents]))
just concat into one big document
Sorry, I mean I can do that. But why does SimpleDirectoryReader chuck?
They are pre-parsed into pages so that tracking sources is easier (i.e. page numbers)
it's in the metadata of each document object
But to override the behavior of DEFAULT_FILE_READER_CLASS I need to do so for each file type correct?
If I have markdown, pdf, whatever to load, can I chunk them all the same way if I wanted?
Sorry for the probably dumb questions, I'm learning Python and LLMs as I go.
Not for each file type
It will still use the same old default loaders for other file types
One more. I'm using Chroma locally for testing and docstore is empty. How do I properly save and the load the docstore?
Just have to run storage_content.docstore.add_documents? Or something else?
In other words I thought Chroma saved the doc store already
When using vector db integrations, we serialize all the nodes into the db, to simplify storage

You'll have to manually use/save/load the docstore if you need it
So for chroma docstore = SimpleDocumentStore.get_nodes(???) for loading from the DB?
Like I said, Python rookie.
Not quite

Plain Text
from llama_index.storage.docstore import SimpleDocumentStore

docstore = SimpleDocumentStore()
docstore.add_documents(documents)

docstore.persist(persist_path="./docstore.json")
loaded_docstore = SimpleDocumentStore.from_persist_path("./docstore.json")
something like that
But that docstore in Chroma...it'd be in the embedding metadata table I believe
I'll do a simple one to files, then add back chroma to compare the DB to the json version
Basically using this example, I can add docs to docstore okay but can't get the docstore to load from the chromadb. https://gpt-index.readthedocs.io/en/latest/examples/vector_stores/ChromaIndexDemo.html
You need to load/persist the docstore to disk like the above example I gave, or use a redis or mongodb docstore integration. You can't put the docstore in chromadb sadly
Ahh. Looking at the vector store comparison I saw the checkmark under docs for chroma: https://docs.llamaindex.ai/en/latest/module_guides/storing/vector_stores.html
Misunderstood it sounds like
Thought I could "recreate" the files from the metadata stored in Chroma.
Ah yea. However, when you perform a query or retrieval, you can get the source nodes and their metadata
So what would I add to this to save and then open the docstore? Does llamaindex support that? Or do I need to use the chroma library?
This being that πŸ˜„
Hey, thanks for all the help! It led me to exactly what I wanted to create. I know you get bombarded by a bunch of requests all the time, but I really appreciate it!
Glad you were able to create what you needed! πŸ’ͺ
Add a reply
Sign up and join the conversation on Discord