simple_directory_reader.load_data
and VectorStoreIndex.from_documents
, we will retrieve Node chunks, not the entire file. Do we have to manually put each file into a Node, or are there better ways?) Thanksnode = TextNode("ADD_MARKDOWN_FILE_CONTEXT HERE") index = VectorStoreIndex([node]) # Query and retrive the file from here
docstore.get_document(doc_id)
docstore
but it looks very useful! I am trying this example: https://docs.llamaindex.ai/en/stable/examples/docstore/DocstoreDemo.html When I also want to persist those indexes created in the example, I wasn't able to load it because it needs a index_id
which I don't know where to get it. e.g., # from the example storage_context = StorageContext.from_defaults(docstore=docstore) summary_index = SummaryIndex(nodes, storage_context=storage_context) vector_index = VectorStoreIndex(nodes, storage_context=storage_context) keyword_table_index = SimpleKeywordTableIndex( nodes, storage_context=storage_context ) # added code to persist vector_index.storage_context.persist("storage/docstoredemo_vector_index") local_vector_store_index = load_index_from_storage(StorageContext.from_defaults(persist_dir="storage/docstoredemo_vector_index"))
File ~/github/llama_index/llama_index/indices/loading.py:40, in load_index_from_storage(storage_context, index_id, **kwargs) 36 raise ValueError( 37 "No index in storage context, check if you specified the right persist_dir." 38 ) 39 elif len(indices) > 1: ---> 40 raise ValueError( 41 f"Expected to load a single index, but got {len(indices)} instead. " 42 "Please specify index_id." 43 ) 45 return indices[0] ValueError: Expected to load a single index, but got 3 instead. Please specify index_id.
local_vector_store_index = load_index_from_storage(StorageContext.from_defaults(persist_dir="storage/docstoredemo_vector_index"), index_id="vector_index")
but that doesn't work either ...index.set_index_id("my_index")
doc_id
in docstore.get_document(doc_id)
, from the example notebook, nodes = SentenceSplitter().get_nodes_from_documents(documents) docstore = SimpleDocumentStore() docstore.add_documents(nodes)
doc_id
is still the node_id
or the id_
attribute of the Node object, is that correct?TextNode
from BaseNode
. So we can do VectorStoreIndex (docs)
, why do we still need VectorStoreIndex.from_documents(docs)
? Or are they essentially the same now?# create parser and parse document into nodes parser = SentenceSplitter() nodes = parser.get_nodes_from_documents(documents) # create storage context using default stores storage_context = StorageContext.from_defaults( docstore=SimpleDocumentStore(), vector_store=SimpleVectorStore(), index_store=SimpleIndexStore(), ) # create (or load) docstore and add nodes storage_context.docstore.add_documents(nodes) # build index index = VectorStoreIndex(nodes, storage_context=storage_context)
storage_context
's docstore
already has nodes
added, why do we have to provide the same nodes
again when we construct the index
, what if different nodes are provided when building index
?from_documents
will parse the documents into smaller chunks (I.e. nodes). Initializing directly with the constructor skips that stepVectoreStoreIndex.from_documents
my Documents will be chunked into Nodes, and I can do all the normal operations and retrieve Node objects. that Node object actually has a _ref_doc_id
attribute that can link me back to the original Document object, In this case: is there a good way we can get the entire Document content in its original form from all nodes with this _ref_doc_id
(essentially like reverse chunking process)?_ref_doc_id
. It's just a little inefficient because the Document and Nodes essentially contains the same information.