Yeah I saw that, but it doesn't change how the docs exist. For example, they're already chunked into pages at documents=
. I can't unchunk them.
documents = SimpleDirectoryReader("./data").load_data()
text_splitter = SentenceSplitter(chunk_size=1024, chunk_overlap=20)
service_context = ServiceContext.from_defaults(text_splitter=text_splitter)
index = VectorStoreIndex.from_documents(
documents, service_context=service_context
)
You can un-chunk them though
mega_document = Document(text="".join([doc.text for doc in documents]))
just concat into one big document
Sorry, I mean I can do that. But why does SimpleDirectoryReader chuck?
They are pre-parsed into pages so that tracking sources is easier (i.e. page numbers)
it's in the metadata of each document object
But to override the behavior of DEFAULT_FILE_READER_CLASS I need to do so for each file type correct?
If I have markdown, pdf, whatever to load, can I chunk them all the same way if I wanted?
Sorry for the probably dumb questions, I'm learning Python and LLMs as I go.
It will still use the same old default loaders for other file types
One more. I'm using Chroma locally for testing and docstore
is empty. How do I properly save and the load the docstore?
Just have to run storage_content.docstore.add_documents
? Or something else?
In other words I thought Chroma saved the doc store already
When using vector db integrations, we serialize all the nodes into the db, to simplify storage
You'll have to manually use/save/load the docstore if you need it
So for chroma docstore = SimpleDocumentStore.get_nodes(???)
for loading from the DB?
Like I said, Python rookie.
Not quite
from llama_index.storage.docstore import SimpleDocumentStore
docstore = SimpleDocumentStore()
docstore.add_documents(documents)
docstore.persist(persist_path="./docstore.json")
loaded_docstore = SimpleDocumentStore.from_persist_path("./docstore.json")
But that docstore in Chroma...it'd be in the embedding metadata table I believe
I'll do a simple one to files, then add back chroma to compare the DB to the json version
You need to load/persist the docstore to disk like the above example I gave, or use a redis or mongodb docstore integration. You can't put the docstore in chromadb sadly
Misunderstood it sounds like
Thought I could "recreate" the files from the metadata stored in Chroma.
Ah yea. However, when you perform a query or retrieval, you can get the source nodes and their metadata
So what would I add to this to save and then open the docstore? Does llamaindex support that? Or do I need to use the chroma library?
Hey, thanks for all the help! It led me to exactly what I wanted to create. I know you get bombarded by a bunch of requests all the time, but I really appreciate it!
Glad you were able to create what you needed! πͺ