Find answers from the community

Updated 2 months ago

Got a simple question about using the

Got a simple question about using the storage context for multiple indices. So far, I've only created/used a single index but now I want to try the SummaryIndex -alongside my VectorStoreIndex. In the storage context I'm using, I define the docstore, indexstore, and vectorstore. so I understand all that relates to the vectorstoreindex. but now I want to create a SummaryIndex and I'm unclear if I create a new storage context or if I just pass my existing storage context to the constructor of the SummaryIndex. When I persist the storage context what controls the file name of the SummaryIndex data structure? For example, for the docstore I can pass a file path to what ever I want the docstore to be called, but there is no 'summary_store' argument. Or is there? Right now, I'm using the same documents as the vector_store I'm using. THanks for any clarity!
t
L
21 comments
I tried passing my existing storage_context to the new SummaryIndex and I'm noticing issues. I can initially create the SummaryIndex with .from_documents() but .refresh_ref_docs() does not work. I'm wondering if storing my docs in a chromadb is causing an issue?
collection_name = "technical_notes_vector_store" client_load = chromadb.PersistentClient(path="C:/projects/technical-notes-llm-report/data/06_models/chroma_db") collection_load = client_load.get_or_create_collection(collection_name) vector_store_load = ChromaVectorStore(chroma_collection=collection_load) storage_context_load = StorageContext.from_defaults( docstore=SimpleDocumentStore.from_persist_path("C:/projects/technical-notes-llm-report/data/06_models/chroma_db/docstore.json"), index_store=SimpleIndexStore.from_persist_path("C:/projects/technical-notes-llm-report/data/06_models/chroma_db/index_store.json"), vector_store=ChromaVectorStore(chroma_collection=collection_load), ) summary_index = SummaryIndex.from_documents([], storage_context=storage_context_load, service_context=service_context) summary_index.refresh_ref_docs(valid_docs, service_context=service_context) -> [all False] len(summary_index.ref_doc_info.keys())->0
You can persist each index into the same storage context, you just need to give each index an index_id

index.set_index_id("test")

The single docstore.json and index_store.json will hold all the shared info

But I think you'll need to use load_index_from_storage(storage_context_load, index_id="test") for this to work properly? maybe? πŸ€”
Thanks Logan, I was setting the index_id and I am able to fetch both, the issue is getting refresh_ref_docs() to work. I keep running into issues trying to add documents. and now I'm getting errors trying to use refresh_ref_docs() on my original VectorStoreIndex.
I keep getting the following error:
object of type 'Document' has no len() ********** Trace: refresh |_exception -> 0.0 seconds **********
I don't understand the error because thats the error .refresh_ref_docs() is throwing and I am passing a list of Documents. The only thing that stands out is that the documents are long, but they are Llama Index document instances. I'll upload 1 maybe you can recommend something?
Yea I don't think it's an issue with the length or anything.

Do you have the full error? Are you extra sure you are passing a list of document objects to the refresh?
How do I get more of the error? Yeah, I have two lists and 1 has 144 and the other has 105. I'm using this code to try the refresh and each document is now generating the error:
if len(valid_documents): for doc in valid_documents: try: results_valid = vector_index.refresh_ref_docs(doc) except Exception as e: print(f"Error refreshing valid doc:\n{doc} \n{e}")
You should pass in the entire valid_documents list no? The code tells me you are passing in one-by-one
yeah, because I wanted to see if the problem was isolated to a specific document and if I pass in the list its harder to see which document causes the problem. I was originally passing in the entire list and for 2 cycles everythoing worked perfect and then on this 3rd run they all cause the error
If it helps, here are two pickle files for the two lists of documents.
I think your code will work if you just do vector_index.refresh_ref_docs([doc])
I'll try thank you, but the error started before I switched to one doc per refresh
I only added the try/except loop once I started getting the exception on the entire list
When I mentioned the full error, I meant the entire traceback (not just the final error)

That can help me track down where in the code it's breaking. The full traceback should be just printed to the terminal/screen
ah, sorry! I tried your adjustment and now all the documents get added the long way ie., 1 by 1. One follow-up question: during the refresh(), every document that gets added is generating dozens of warnings:
WARNING Delete of nonexisting embedding ID: 855e55b3-228e-426b-826a-c9f57285accc I've never tried adding these documents before, so I'm not sure what is going on with this.
Huh, haven't seen that one before lol
lol! its hilarious how beginners can break all kinds of things ❀️
Hi Logan, here is the full error I'm getting periodically trying to refresh_ref_docs()
`C:\projects\kedro-workbench\src\kedro_workbench\pipelines\refresh_llama_vector_index\nodes.py:19 β”‚
β”‚ 9 in refresh_llama_vector_index β”‚
β”‚ β”‚
β”‚ 196 β”‚ β”‚ β”‚
β”‚ 197 β”‚ β”‚ for i, doc in enumerate(valid_documents): β”‚
β”‚ 198 β”‚ β”‚ β”‚ print(f"item {i}\n{doc}\n") β”‚
β”‚ ❱ 199 β”‚ β”‚ β”‚ results_valid = vector_index.refresh_ref_docs([doc]) β”‚
β”‚ 200 β”‚ β”‚ β”‚ print(f"item {i} inserted/updated: \n{results_valid}\n") β”‚
β”‚ 201 β”‚ else: β”‚
β”‚ 202 β”‚ β”‚ print(f"valid_documents is not a list of Documents {type(valid_documents)}") β”‚
β”‚ β”‚
β”‚ C:\anaconda3\envs\kedro_workbench_venv\lib\site-packages\llama_index\indices\base.py:313 in β”‚
β”‚ refresh_ref_docs β”‚
β”‚ β”‚
β”‚ C:\anaconda3\envs\kedro_workbench_venv\lib\site-packages\llama_index\indices\base.py:277 in β”‚
β”‚ update_ref_doc β”‚
β”‚ β”‚
β”‚ C:\anaconda3\envs\kedro_workbench_venv\lib\site-packages\llama_index\indices\vector_store\base.p β”‚
β”‚ y:312 in delete_ref_doc β”‚
β”‚ β”‚
β”‚ C:\anaconda3\envs\kedro_workbench_venv\lib\site-packages\llama_index\data_structs\data_structs.p β”‚
β”‚ y:198 in delete β”‚
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
KeyError: 'ecf981de-26ce-4e4f-ab94-1aa1a0a9bf80' `
********** Trace: refresh |_exception -> 0.0 seconds |_exception -> 0.0 seconds **********
I can't see anything off with the Document that causes the exception:
Document(id_='version-1140182337-june-2-2023', embedding=None, metadata={'id': 'version-1140182337-june-2-2023', 'source': 'https://learn.microsoft.com/en-us/deployedge/microsoft-edge-relnote-mobile-stable-channel#version-1140182337-june-2-2023', 'collection': 'mobile_stable_channel_notes', 'published': '02-06-2023', 'day_of_week': 'Friday', 'content_link:display_text_2': 'https://learn.microsoft.com/en-us/deployedge/microsoft-edge-relnote-mobile-stable-channel#policy-update', 'content_link:Manage': 'https://learn.microsoft.com/en-us/mem/intune/apps/manage-microsoft-edge#ios-website-data-store', 'hash': 'ddec246a025a172b64c2371fe2133be0d63c1d1f9878d45c1345d14492edb399', 'keywords': 'Android, iOS', 'cve_fixes': '', 'cve_mentions': ''}, excluded_embed_metadata_keys=['id', 'day_of_week', 'hash', 'keywords', 'content_link:display_text_2', 'content_link:Manage'], excluded_llm_metadata_keys=['id', 'day_of_week', 'hash'], relationships={}, hash='86e31e35970e870d2ef95543d9227b0696fc501ff77fbf6a7a681fd443108ba9', text="Version 114.0.1823.37: June 2, 2023 \nVersion 114.0.1823.37: June 2, 2023 \nFixed various bugs and performance issues. \nPolicy update \niOS Website data store access. \n Currently, the persistent data store is only statically used by personal accounts. Because work or school accounts can't use this data store, browsing data rather than cookies are lost when their sessions end. This new policy lets organizations access the data store dynamically, which persists browsing data for work or school accounts, giving users a better browsing experience. For more information, see this policy in \nManage Microsoft Edge on iOS and Android with Intune \n. \n", start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n')
the other interesting thing is that when this document is attempted, that's when a whole bunch of those WARNING Delete of nonexisting embedding ID: b23ff1f8-99e9-4702-be0b-64f9ca10a52e in some cases, these warnings go crazy, in others nothing. I found that this warning is Chroma specific.
  • I think there is something wrong with how embeddings are being stored or referenced with the ChromaVectorStore. I successfully, added ~2500 documents using the steps you provided earlier but this issue with the nonexistent embed IDs kept coming up.
  • I just tried creating a basic query_engine for the VectorStoreIndex and the first time I ran it, literally 2500 of those Warnings popped up and then the query_engine responds:
[09/27/23 07:41:28] WARNING Number of requested results 5 is greater than number of local_persistent_hnsw.py:320 elements in index 0, updating n_results = 0 ********** Trace: query |_query -> 0.907415 seconds |_retrieve -> 0.906342 seconds |_embedding -> 0.894871 seconds |_synthesize -> 0.001073 seconds ********** None
The documents are still there but after all the warnings and nonexistent embed Ids finishes, the index breaks...
Add a reply
Sign up and join the conversation on Discord