Find answers from the community

Updated 2 months ago

I’m manually rebuilding an index from my

I’m manually rebuilding an index from my vector_store and it’s breaking a few things, although the chat engine works fine. I think it has to do with my Metadata Handling.

Issues:
  1. My query engine is not honoring the node_text_template. In image_1, the node is properly formatted, and metadata keys are excluded as expected. In image_2, they’re not, even though the node._node_content.text_template is explicitly "text_template": "[Excerpt from document]\n{metadata_str}\nExcerpt:\n-----\n{content}\n-----". This means I'm sending the LLMs junk that could mislead it.
  1. I’m getting chat responses and no errors, but my Citations aren’t showing up. When inspecting sub_question_answer_pair.sources between image_1 and image_2, the only difference is the former seems to be missing _node_content
Pasting relevant code snippets in thread. Appreciate any help here 🙏
Attachments
image_1.png
image_2.png
J
L
26 comments
I think I have a cache issue that drops my index_structs, so load_index_from_storage fails to load indexes that have already been built.

As a failsafe, I made check_and_rebuild_indicies_from_vector_store to rebuild the index if the nodes already exist in the database.

Freshly created indexes work great. rebuilt_index results in the issues in the main thread.
Are you usuing a vector db? or just the default?
oh the code is there lol
yea the way you are saving/rebuilding seems a little strange.

Any reason to not use fsspec for aws?

Plain Text
index.storage_context.persist(persist_dir="s3_dir", fs=aws_fsspec)
...
storage_context = StorgeContext.from_defaults(persist_dir="s3_dir", fs=aws_fsspec)
index = load_index_from_storage(storage_context, service_context=service_context)
fs is actually defined elsewhere, but it is s3fs.S3FileSystem
I want to refactor the caching/persisting issue separately (it needs a bigger overhaul, including partitioning the docstore)
I guess the whole check_and_rebuild_indicies_from_vector_store function feels a little confusing.

Why do you need to rebuild the index without using load_index_from_storage() ?
Plain Text
I think I have a cache issue that drops my index_structs, so load_index_from_storage fails to load indexes that have already been built.


hmm 🤔
perist() should be saving the index stucts in the index store
I first try to load it, but if the index_struct gets lost (perhaps just because I'm on local dev, and something wonky happens), I don't want to have to rebuild the index w/ transformations/embeddings, and all that jazz

Just want to rebuild it:
Plain Text
if index_id in existing_index_ids:
    logger.info(f"Found existing index {index_id}")
    loaded_index = load_index_from_storage(
        storage_context,
        index_id=index_id,
        service_context=service_context,
    )
    indices.append(loaded_index)
else:
    logger.info(f"Could not find existing index {index_id}. Checking if nodes exist in vector store...")
    rebuilt_index = await check_and_rebuild_indicies_from_vector_store(
        index_id=index_id,
        service_context=service_context,
        fs=fs
    )
    indices.append(rebuilt_index)
oh boy 😅 Hmm, I guess overall this feels more confusing than it should be, at least at first glance. A bit difficult to parse where the issue is.

I see you are also using CustomPGVectorStore -- by default, as you probably know, the docstore and index store won't be populated if stores_text=True on the vector store.

You can override that behavior by setting store_nodes_override=True in the vector index constructor.
Yea, wish I knew when/why index_structs get dropped. I've had a hard time reproducing it, so I just wanted a workaround I had control over 🫤
I've never had the issue personally 😅 But I suspect it might be related to needing to set store_nodes_override=True when creating your index
Plain Text
index = VectorStoreIndex.from_documents(
    documents, 
    service_context=service_context, 
    storage_context=StorageContext.from_defaults(vector_store=vector_store),
    store_nodes_override=True
)

index.storage_context.persist(...)

index = load_index_from_storage(StorageContext.from_defaults(persist_dir="./storage", vector_store=vector_store))
some flow that like
If I do VectorStoreIndex.from_documents, then it tries to rebuild the index and creates new embeddings
rebuilt_index = VectorStoreIndex.from_vector_store works, but idk if it has store_nodes_override
Yea it will re-embed, but I think you kind of need to start fresh to fix this issue 😅 store_nodes_override only matters when indexing new data

from_vector_store was really only meant for remote vector dbs. Although it kind of works here, the preffered way to do it is load_index_from_storage() or alternatively VectoreStoreIndex([], storage_context=storage_context, service_context=service_context)
Tried a bunch of approaches, including the latter.

Only one that didn't duplicate nodes and rerun embeddings is from_vector_store
I guess I'm saying you'll need to properly reingest/re-embed your data at least once in order to save to proper index data, because it seems like somewhere along the way things got borked. If you rebuild and save properly, there should be no issues, but you likely need to disregard data you've already saved.

Or at least, try with a subset of data
Plain Text
index = VectorStoreIndex.from_documents(
    documents, 
    service_context=service_context, 
    storage_context=StorageContext.from_defaults(vector_store=vector_store),
    store_nodes_override=True
)

index.storage_context.persist(...)

index = load_index_from_storage(StorageContext.from_defaults(persist_dir="./storage", vector_store=vector_store))


This flow should be bullet proof tbh. Create the index and re-embed, but use store_nodes_override.

Persist from there, and then you should have no issues using load_index_from_storage
After a long slog, I figured out the issue. Was just with my Metadata handling.

In my code above, I was querying the vector_store, returning a list of [TextNode] , saving it as all_vectors, and passing that directly when rebuilding the index. The issue is, when casting the query results as TextNodes, the old metadata gets re-wrapped as new metadata, so every N times I was query-saving-casting nodes, the metadata would get nested N times. That's why on >=2 chat calls, the text_template and excluded_llm_metadata_keys weren't being honored -- they weren't in the right place!

The solution was to rebuild the nodes' metadata using metadata_dict_to_node:
Plain Text
rebuilt_nodes = []
for vector in all_vectors:
    node = metadata_dict_to_node(vector.metadata, vector.text)
    rebuilt_nodes.append(node)
logger.info(f"Rebuilt {len(rebuilt_nodes)} nodes from the vector_store")


Then I could add rebuilt_nodes to the docstore, use that to create a new storage_context, and use that to rebuild the index from from_vector_store 😮‍💨.

This approach doesn't recreate embeddings, and I didn't end up needing from_documents / store_nodes_override / load_index_from_storage 🥳
Note: the from_documents / store_nodes_override / load_index_from_storage approach gave me the same outcome as the original post. I figured out what was happening by logging node metadata every step of the way, and combing through changes. That's how I realized the metadata was being mutated by TextNode
ah yea ok the make sense. Nice!
Thanks for your help, @Logan M!
Add a reply
Sign up and join the conversation on Discord