Hi, if we have a list of md files in a

nndaugreal

Hi, if we have a list of md files in a directory, and we want to retrieve a selected few of those files (based on a query), and supply those files as context for a prompt. What is the best way to achieve this in LlamaIndex? (If we do normal simple_directory_reader.load_data and VectorStoreIndex.from_documents, we will retrieve Node chunks, not the entire file. Do we have to manually put each file into a Node, or are there better ways?) Thanks

12 comments

WWhiteFang_Jr

You will need to create Nodes at your side and then pass these nodes to the VectorStoreIndex

Plain Text

node = TextNode("ADD_MARKDOWN_FILE_CONTEXT HERE")
index = VectorStoreIndex([node])

# Query and retrive the file from here

But make sure your node does not crosses the max token limit for the llm else it will fail at generating response.

nndaugreal

Yeah, in this case we have to manually manage the size. Are there better ways? Can we leverage Document Summary Index, it seems to maintain both a summary and all nodes associated with the document. I am checking out the example at https://docs.llamaindex.ai/en/stable/examples/index_structs/doc_summary/DocSummary.html it seems to be able to retrieve all nodes for the top matched document, not sure how easy/difficult it will be to extend it to multiple documents (with all their respective nodes).

LLogan M

you can use the document summary index yea, it will retrieve based on summaries and send the entire document

Another option is just writing a custom node-postprocessor to replace retrieved nodes with their parent documents

nndaugreal

thanks @Logan M the node postporcessor approach is a great one. I am looking at https://docs.llamaindex.ai/en/stable/module_guides/querying/node_postprocessors/root.html do you have any pointer to how I can obtain the full parent document from a retrieved node? (Other than, e.g., if the retrieved node metadata has the file path, we can probably read from that file path again, but are there better ways to do it?

LLogan M

You could also insert the parent documents into a docstore object, and use docstore.get_document(doc_id)

nndaugreal

Thanks I was not aware of docstore but it looks very useful! I am trying this example: https://docs.llamaindex.ai/en/stable/examples/docstore/DocstoreDemo.html When I also want to persist those indexes created in the example, I wasn't able to load it because it needs a index_id which I don't know where to get it. e.g.,

Plain Text

# from the example
storage_context = StorageContext.from_defaults(docstore=docstore)
summary_index = SummaryIndex(nodes, storage_context=storage_context)
vector_index = VectorStoreIndex(nodes, storage_context=storage_context)
keyword_table_index = SimpleKeywordTableIndex(
    nodes, storage_context=storage_context
)

# added code to persist 
vector_index.storage_context.persist("storage/docstoredemo_vector_index")

local_vector_store_index = load_index_from_storage(StorageContext.from_defaults(persist_dir="storage/docstoredemo_vector_index"))

gives error:

Plain Text

File ~/github/llama_index/llama_index/indices/loading.py:40, in load_index_from_storage(storage_context, index_id, **kwargs)
     36     raise ValueError(
     37         "No index in storage context, check if you specified the right persist_dir."
     38     )
     39 elif len(indices) > 1:
---> 40     raise ValueError(
     41         f"Expected to load a single index, but got {len(indices)} instead. "
     42         "Please specify index_id."
     43     )
     45 return indices[0]

ValueError: Expected to load a single index, but got 3 instead. Please specify index_id.

I also tried

local_vector_store_index = load_index_from_storage(StorageContext.from_defaults(persist_dir="storage/docstoredemo_vector_index"), index_id="vector_index")

but that doesn't work either ...

LLogan M

Yea you should set the index_id for each index if you share the storage context like that 👍

index.set_index_id("my_index")

nndaugreal

Thanks that worked.

I am still confused about what is the doc_id in docstore.get_document(doc_id) , from the example notebook,

Plain Text

nodes = SentenceSplitter().get_nodes_from_documents(documents)
docstore = SimpleDocumentStore()
docstore.add_documents(nodes)

So it seems everything is still operating at the node level not the document level. If that is the case, doc_id is still the node_id or the id_ attribute of the Node object, is that correct?

LLogan M

Correct -- doc_id is an older terminology we can't shake lol

Just assume almost everything to do either the docstore also works for nodes

nndaugreal

I see. that makes sense. a couple of followup questions:

I saw "Document" is also derived from TextNode from BaseNode. So we can do VectorStoreIndex (docs), why do we still need VectorStoreIndex.from_documents(docs)? Or are they essentially the same now?

In the following code from the example notebook:

Plain Text

# create parser and parse document into nodes
parser = SentenceSplitter()
nodes = parser.get_nodes_from_documents(documents)

# create storage context using default stores
storage_context = StorageContext.from_defaults(
    docstore=SimpleDocumentStore(),
    vector_store=SimpleVectorStore(),
    index_store=SimpleIndexStore(),
)

# create (or load) docstore and add nodes
storage_context.docstore.add_documents(nodes)

# build index
index = VectorStoreIndex(nodes, storage_context=storage_context)

Since the storage_context's docstore already has nodes added, why do we have to provide the same nodes again when we construct the index, what if different nodes are provided when building index?

LLogan M

from_documents will parse the documents into smaller chunks (I.e. nodes). Initializing directly with the constructor skips that step

The docstore mearly holds the nodes. When you create the vector store index here, the vector store itself is populated with a mapping of node ID to e.bedding vectors, and the index store is updated with a list of IDs that belong to that index

nndaugreal

Thanks! Now I see. When I use VectoreStoreIndex.from_documents my Documents will be chunked into Nodes, and I can do all the normal operations and retrieve Node objects. that Node object actually has a _ref_doc_id attribute that can link me back to the original Document object, In this case: is there a good way we can get the entire Document content in its original form from all nodes with this _ref_doc_id (essentially like reverse chunking process)?

If that's not available, from your earlier reply, I think a workaround is that I can also store the Document objects directly into the docstore, and then use docstore.get_document() to retrieve the full Document content based on the _ref_doc_id. It's just a little inefficient because the Document and Nodes essentially contains the same information.

Add a reply

Find answers from the community

Hi, if we have a list of md files in a