LlamaIndex

Log inLog into community

Find answers from the community

Updated 6 months ago

I’m manually rebuilding an index from my

I’m manually rebuilding an index from my

At a glance

The community member is manually rebuilding an index from their vector_store, which is causing issues with their query engine not honoring the node_text_template and their citations not showing up. They suspect the issue is related to their metadata handling.

The comments suggest the community member may have a cache issue that is causing their index_structs to be dropped, preventing them from loading the indexes from storage. They have implemented a workaround to rebuild the index from the vector_store, but this is also causing issues.

The community member tries various approaches, including using VectorStoreIndex.from_documents with store_nodes_override=True, and VectorStoreIndex.from_vector_store, but these still result in problems.

After further investigation, the community member discovers the issue is related to how the metadata is being handled when rebuilding the nodes from the vector_store. The solution is to rebuild the nodes' metadata using metadata_dict_to_node, which resolves the issues with the text_template and excluded_llm_metadata_keys.

The community member notes that the from_documents / store_nodes_override /

Useful resources

·

I’m manually rebuilding an index from my vector_store and it’s breaking a few things, although the chat engine works fine. I think it has to do with my Metadata Handling.

Issues:

My query engine is not honoring the node_text_template. In image_1, the node is properly formatted, and metadata keys are excluded as expected. In image_2, they’re not, even though the node._node_content.text_template is explicitly "text_template": "[Excerpt from document]\n{metadata_str}\nExcerpt:\n-----\n{content}\n-----". This means I'm sending the LLMs junk that could mislead it.

I’m getting chat responses and no errors, but my Citations aren’t showing up. When inspecting sub_question_answer_pair.sources between image_1 and image_2, the only difference is the former seems to be missing _node_content

Pasting relevant code snippets in thread. Appreciate any help here 🙏

Attachments

J

L

26 comments

I think I have a cache issue that drops my index_structs, so load_index_from_storage fails to load indexes that have already been built.

As a failsafe, I made check_and_rebuild_indicies_from_vector_store to rebuild the index if the nodes already exist in the database.

Freshly created indexes work great. rebuilt_index results in the issues in the main thread.

Are you usuing a vector db? or just the default?

oh the code is there lol

yea the way you are saving/rebuilding seems a little strange.

Any reason to not use fsspec for aws?

Plain Text

index.storage_context.persist(persist_dir="s3_dir", fs=aws_fsspec)
...
storage_context = StorgeContext.from_defaults(persist_dir="s3_dir", fs=aws_fsspec)
index = load_index_from_storage(storage_context, service_context=service_context)

Example with AWS here https://docs.llamaindex.ai/en/stable/module_guides/storing/save_load.html#using-a-remote-backend

fs is actually defined elsewhere, but it is s3fs.S3FileSystem

I want to refactor the caching/persisting issue separately (it needs a bigger overhaul, including partitioning the docstore)

I guess the whole check_and_rebuild_indicies_from_vector_store function feels a little confusing.

Why do you need to rebuild the index without using load_index_from_storage() ?

Plain Text

I think I have a cache issue that drops my index_structs, so load_index_from_storage fails to load indexes that have already been built.

hmm 🤔

perist() should be saving the index stucts in the index store

I first try to load it, but if the index_struct gets lost (perhaps just because I'm on local dev, and something wonky happens), I don't want to have to rebuild the index w/ transformations/embeddings, and all that jazz

Just want to rebuild it:

Plain Text

if index_id in existing_index_ids:
    logger.info(f"Found existing index {index_id}")
    loaded_index = load_index_from_storage(
        storage_context,
        index_id=index_id,
        service_context=service_context,
    )
    indices.append(loaded_index)
else:
    logger.info(f"Could not find existing index {index_id}. Checking if nodes exist in vector store...")
    rebuilt_index = await check_and_rebuild_indicies_from_vector_store(
        index_id=index_id,
        service_context=service_context,
        fs=fs
    )
    indices.append(rebuilt_index)

oh boy 😅 Hmm, I guess overall this feels more confusing than it should be, at least at first glance. A bit difficult to parse where the issue is.

I see you are also using CustomPGVectorStore -- by default, as you probably know, the docstore and index store won't be populated if stores_text=True on the vector store.

You can override that behavior by setting store_nodes_override=True in the vector index constructor.

Yea, wish I knew when/why index_structs get dropped. I've had a hard time reproducing it, so I just wanted a workaround I had control over 🫤

I've never had the issue personally 😅 But I suspect it might be related to needing to set store_nodes_override=True when creating your index

Plain Text

index = VectorStoreIndex.from_documents(
    documents, 
    service_context=service_context, 
    storage_context=StorageContext.from_defaults(vector_store=vector_store),
    store_nodes_override=True
)

index.storage_context.persist(...)

index = load_index_from_storage(StorageContext.from_defaults(persist_dir="./storage", vector_store=vector_store))

some flow that like

If I do VectorStoreIndex.from_documents, then it tries to rebuild the index and creates new embeddings

rebuilt_index = VectorStoreIndex.from_vector_store works, but idk if it has store_nodes_override

Yea it will re-embed, but I think you kind of need to start fresh to fix this issue 😅 store_nodes_override only matters when indexing new data

from_vector_store was really only meant for remote vector dbs. Although it kind of works here, the preffered way to do it is load_index_from_storage() or alternatively VectoreStoreIndex([], storage_context=storage_context, service_context=service_context)

Tried a bunch of approaches, including the latter.

Only one that didn't duplicate nodes and rerun embeddings is from_vector_store

I guess I'm saying you'll need to properly reingest/re-embed your data at least once in order to save to proper index data, because it seems like somewhere along the way things got borked. If you rebuild and save properly, there should be no issues, but you likely need to disregard data you've already saved.

Or at least, try with a subset of data

Plain Text

index = VectorStoreIndex.from_documents(
    documents, 
    service_context=service_context, 
    storage_context=StorageContext.from_defaults(vector_store=vector_store),
    store_nodes_override=True
)

index.storage_context.persist(...)

index = load_index_from_storage(StorageContext.from_defaults(persist_dir="./storage", vector_store=vector_store))

This flow should be bullet proof tbh. Create the index and re-embed, but use store_nodes_override.

Persist from there, and then you should have no issues using load_index_from_storage

After a long slog, I figured out the issue. Was just with my Metadata handling.

In my code above, I was querying the vector_store, returning a list of [TextNode] , saving it as all_vectors, and passing that directly when rebuilding the index. The issue is, when casting the query results as TextNodes, the old metadata gets re-wrapped as new metadata, so every N times I was query-saving-casting nodes, the metadata would get nested N times. That's why on >=2 chat calls, the text_template and excluded_llm_metadata_keys weren't being honored -- they weren't in the right place!

The solution was to rebuild the nodes' metadata using metadata_dict_to_node:

Plain Text

rebuilt_nodes = []
for vector in all_vectors:
    node = metadata_dict_to_node(vector.metadata, vector.text)
    rebuilt_nodes.append(node)
logger.info(f"Rebuilt {len(rebuilt_nodes)} nodes from the vector_store")

Then I could add rebuilt_nodes to the docstore, use that to create a new storage_context, and use that to rebuild the index from from_vector_store 😮‍💨.

This approach doesn't recreate embeddings, and I didn't end up needing from_documents / store_nodes_override / load_index_from_storage 🥳

Note: the from_documents / store_nodes_override / load_index_from_storage approach gave me the same outcome as the original post. I figured out what was happening by logging node metadata every step of the way, and combing through changes. That's how I realized the metadata was being mutated by TextNode

ah yea ok the make sense. Nice!

Thanks for your help, @Logan M!

Add a reply

Sign up and join the conversation on Discord